Very low basemeans compared to another pipeline

murphytho commented 3 months ago

Hi there,

I have a dataset that I'm running two separate pipelines on-- one is the TEtranscripts pipeline (built from the documentation here) and the other is a more standard RNA-seq pipeline that consists of STAR (with default settings), featurecounts (from the subread package), and DESeq2.

When I compared the outputs of the two pipelines (specifically the gene outputs-- I haven't touched the TE data yet), I noticed I had far fewer differentially expressed genes in the TEtranscripts pipeline. In the featurecounts pipeline, I find about 1600 DEGs (p value cutof of .05), while in the TEtranscripts pipeline, I get about 300. Digging a little deeper, I found that the basemeans from the TEtranscripts pipeline were far, far lower than in the other output. As an example, one gene has a basemean of about 564 in the featurecounts pipeline but about 6 in the TEtranscripts pipeline. Another goes from almost 26,000 to a bit under 8,000. Hence, the change between each pipeline is inconsistent.

I ran featurecounts on the BAM files produced in the TEtranscripts pipeline and the outputs look nearly identical, so it does not seem to be a problem with STAR and the alignments. The changes are large enough that I cannot imagine it being a difference in normalization.

In addition, I have run both the pipelines on other data in the past and the outputs end up very similar-- maybe a small difference in basemeans but nothing that couldn't be explained by normalization differences. This is either a new issue or is unique to this dataset.

Any ideas on what may be causing this? Let me know if there is other information that might be useful.

Thanks so much, Thomas

olivertam commented 3 months ago

Hi,

Thank you for your interest in the software.

Are you working with a stranded library, or unstranded? A most common cause of differences in baseMean could be the selection of the wrong strandedness for either of the analysis. For example, if the library is unstranded and you ran it in stranded mode, then the counts would be much lower. If you don't mind sharing the command lines for both featureCounts and TEtranscripts run, I can see if there's an obvious difference there. Another thing you could also provide would be the log output of TEtranscripts as well, which might contain hints.

Thanks.

murphytho commented 3 months ago

Thanks for your quick response!

The dataset is reverse stranded. The TEtranscripts code: TEtranscripts -t STAR_TETranscripts2/A1_CH077_TE_Aligned.sortedByCoord.out.bam STAR_TETranscripts2/X_CH077_TE_Aligned.sortedByCoord.out.bam STAR_TETranscripts2/Y_CH077_TE_Aligned.sortedByCoord.out.bam STAR_TETranscripts2/Z_CH077_TE_Aligned.sortedByCoord.out.bam -c STAR_TETranscripts2/A1_mock_TE_Aligned.sortedByCoord.out.bam STAR_TETranscripts2/X_mock_TE_Aligned.sortedByCoord.out.bam STAR_TETranscripts2/Y_mock_TE_Aligned.sortedByCoord.out.bam STAR_TETranscripts2/Z_mock_TE_Aligned.sortedByCoord.out.bam --GTF TETranscripts_refs/gencode.v44.primary_assembly.annotation.gtf --TE TETranscripts_refs/GRCh38_GENCODE_rmsk_TE.gtf --stranded reverse --sortByPos --project CH077_to_mock_sense_only --outdir TETranscripts_out2/CH077_to_mock_sense_only

With the output (I'm only showing an example of stats from one alignment-- hopefully this is enough: In library STAR_TETranscripts2/Z_mock_TE_Aligned.sortedByCoord.out.bam: Total annotated reads = 2888411 Total non-uniquely mapped reads = 8451634 Total unannotated reads = 12305

Example featurecounts code: featureCounts -T 10 -s 2 -t exon -g gene_id -a TETranscript_refs/gencode.v44.primary_assembly.annotation.gtf -o featurecounts_out_exon/X_CH077_hg38_featurecounts_exon.txt STAR_TETranscripts2/X_CH077_TE_Aligned.sortedByCoord.out.bam

So just to clarify, when I use TEtranscripts as I show above, I get super low base means and far fewer DEGs compared to when I run featurecounts and then DESeq2 separately. When I use other alignment files (made with default STAR settings) I get very similar or identical results to using the alignment files bound for TEtranscripts.

olivertam commented 3 months ago

Hi,

A couple more things: 1) Could you print out the header for the BAM that you used for TEtranscripts?

$ samtools view -H [BAM]

2) Could you provide the count tables for the two pipelines (e.g. CH077_to_mock_sense_only.cntTable and X_CH077_hg38_featurecounts_exon.txt)? I want to see if how different the counts are between samples, as that might give me a clue what could be causing the huge difference.

Thanks.

murphytho commented 3 months ago

Here is the header of an example BAM:

samtools view -H A1_CH077_TE_Aligned.sortedByCoord.out.bam @HD VN:1.4 SO:coordinate @SQ SN:chr1 LN:248956422 @SQ SN:chr2 LN:242193529 @SQ SN:chr3 LN:198295559 @SQ SN:chr4 LN:190214555 @SQ SN:chr5 LN:181538259 @SQ SN:chr6 LN:170805979 @SQ SN:chr7 LN:159345973 @SQ SN:chr8 LN:145138636 @SQ SN:chr9 LN:138394717 @SQ SN:chr10 LN:133797422 @SQ SN:chr11 LN:135086622 @SQ SN:chr12 LN:133275309 @SQ SN:chr13 LN:114364328 @SQ SN:chr14 LN:107043718 @SQ SN:chr15 LN:101991189 @SQ SN:chr16 LN:90338345 @SQ SN:chr17 LN:83257441 @SQ SN:chr18 LN:80373285 @SQ SN:chr19 LN:58617616 @SQ SN:chr20 LN:64444167 @SQ SN:chr21 LN:46709983 @SQ SN:chr22 LN:50818468 @SQ SN:chrX LN:156040895 @SQ SN:chrY LN:57227415 @SQ SN:chrM LN:16569 @SQ SN:GL000008.2 LN:209709 @SQ SN:GL000009.2 LN:201709 @SQ SN:GL000194.1 LN:191469 @SQ SN:GL000195.1 LN:182896 @SQ SN:GL000205.2 LN:185591 @SQ SN:GL000208.1 LN:92689 @SQ SN:GL000213.1 LN:164239 @SQ SN:GL000214.1 LN:137718 @SQ SN:GL000216.2 LN:176608 @SQ SN:GL000218.1 LN:161147 @SQ SN:GL000219.1 LN:179198 @SQ SN:GL000220.1 LN:161802 @SQ SN:GL000221.1 LN:155397 @SQ SN:GL000224.1 LN:179693 @SQ SN:GL000225.1 LN:211173 @SQ SN:GL000226.1 LN:15008 @SQ SN:KI270302.1 LN:2274 @SQ SN:KI270303.1 LN:1942 @SQ SN:KI270304.1 LN:2165 @SQ SN:KI270305.1 LN:1472 @SQ SN:KI270310.1 LN:1201 @SQ SN:KI270311.1 LN:12399 @SQ SN:KI270312.1 LN:998 @SQ SN:KI270315.1 LN:2276 @SQ SN:KI270316.1 LN:1444 @SQ SN:KI270317.1 LN:37690 @SQ SN:KI270320.1 LN:4416 @SQ SN:KI270322.1 LN:21476 @SQ SN:KI270329.1 LN:1040 @SQ SN:KI270330.1 LN:1652 @SQ SN:KI270333.1 LN:2699 @SQ SN:KI270334.1 LN:1368 @SQ SN:KI270335.1 LN:1048 @SQ SN:KI270336.1 LN:1026 @SQ SN:KI270337.1 LN:1121 @SQ SN:KI270338.1 LN:1428 @SQ SN:KI270340.1 LN:1428 @SQ SN:KI270362.1 LN:3530 @SQ SN:KI270363.1 LN:1803 @SQ SN:KI270364.1 LN:2855 @SQ SN:KI270366.1 LN:8320 @SQ SN:KI270371.1 LN:2805 @SQ SN:KI270372.1 LN:1650 @SQ SN:KI270373.1 LN:1451 @SQ SN:KI270374.1 LN:2656 @SQ SN:KI270375.1 LN:2378 @SQ SN:KI270376.1 LN:1136 @SQ SN:KI270378.1 LN:1048 @SQ SN:KI270379.1 LN:1045 @SQ SN:KI270381.1 LN:1930 @SQ SN:KI270382.1 LN:4215 @SQ SN:KI270383.1 LN:1750 @SQ SN:KI270384.1 LN:1658 @SQ SN:KI270385.1 LN:990 @SQ SN:KI270386.1 LN:1788 @SQ SN:KI270387.1 LN:1537 @SQ SN:KI270388.1 LN:1216 @SQ SN:KI270389.1 LN:1298 @SQ SN:KI270390.1 LN:2387 @SQ SN:KI270391.1 LN:1484 @SQ SN:KI270392.1 LN:971 @SQ SN:KI270393.1 LN:1308 @SQ SN:KI270394.1 LN:970 @SQ SN:KI270395.1 LN:1143 @SQ SN:KI270396.1 LN:1880 @SQ SN:KI270411.1 LN:2646 @SQ SN:KI270412.1 LN:1179 @SQ SN:KI270414.1 LN:2489 @SQ SN:KI270417.1 LN:2043 @SQ SN:KI270418.1 LN:2145 @SQ SN:KI270419.1 LN:1029 @SQ SN:KI270420.1 LN:2321 @SQ SN:KI270422.1 LN:1445 @SQ SN:KI270423.1 LN:981 @SQ SN:KI270424.1 LN:2140 @SQ SN:KI270425.1 LN:1884 @SQ SN:KI270429.1 LN:1361 @SQ SN:KI270435.1 LN:92983 @SQ SN:KI270438.1 LN:112505 @SQ SN:KI270442.1 LN:392061 @SQ SN:KI270448.1 LN:7992 @SQ SN:KI270465.1 LN:1774 @SQ SN:KI270466.1 LN:1233 @SQ SN:KI270467.1 LN:3920 @SQ SN:KI270468.1 LN:4055 @SQ SN:KI270507.1 LN:5353 @SQ SN:KI270508.1 LN:1951 @SQ SN:KI270509.1 LN:2318 @SQ SN:KI270510.1 LN:2415 @SQ SN:KI270511.1 LN:8127 @SQ SN:KI270512.1 LN:22689 @SQ SN:KI270515.1 LN:6361 @SQ SN:KI270516.1 LN:1300 @SQ SN:KI270517.1 LN:3253 @SQ SN:KI270518.1 LN:2186 @SQ SN:KI270519.1 LN:138126 @SQ SN:KI270521.1 LN:7642 @SQ SN:KI270522.1 LN:5674 @SQ SN:KI270528.1 LN:2983 @SQ SN:KI270529.1 LN:1899 @SQ SN:KI270530.1 LN:2168 @SQ SN:KI270538.1 LN:91309 @SQ SN:KI270539.1 LN:993 @SQ SN:KI270544.1 LN:1202 @SQ SN:KI270548.1 LN:1599 @SQ SN:KI270579.1 LN:31033 @SQ SN:KI270580.1 LN:1553 @SQ SN:KI270581.1 LN:7046 @SQ SN:KI270582.1 LN:6504 @SQ SN:KI270583.1 LN:1400 @SQ SN:KI270584.1 LN:4513 @SQ SN:KI270587.1 LN:2969 @SQ SN:KI270588.1 LN:6158 @SQ SN:KI270589.1 LN:44474 @SQ SN:KI270590.1 LN:4685 @SQ SN:KI270591.1 LN:5796 @SQ SN:KI270593.1 LN:3041 @SQ SN:KI270706.1 LN:175055 @SQ SN:KI270707.1 LN:32032 @SQ SN:KI270708.1 LN:127682 @SQ SN:KI270709.1 LN:66860 @SQ SN:KI270710.1 LN:40176 @SQ SN:KI270711.1 LN:42210 @SQ SN:KI270712.1 LN:176043 @SQ SN:KI270713.1 LN:40745 @SQ SN:KI270714.1 LN:41717 @SQ SN:KI270715.1 LN:161471 @SQ SN:KI270716.1 LN:153799 @SQ SN:KI270717.1 LN:40062 @SQ SN:KI270718.1 LN:38054 @SQ SN:KI270719.1 LN:176845 @SQ SN:KI270720.1 LN:39050 @SQ SN:KI270721.1 LN:100316 @SQ SN:KI270722.1 LN:194050 @SQ SN:KI270723.1 LN:38115 @SQ SN:KI270724.1 LN:39555 @SQ SN:KI270725.1 LN:172810 @SQ SN:KI270726.1 LN:43739 @SQ SN:KI270727.1 LN:448248 @SQ SN:KI270728.1 LN:1872759 @SQ SN:KI270729.1 LN:280839 @SQ SN:KI270730.1 LN:112551 @SQ SN:KI270731.1 LN:150754 @SQ SN:KI270732.1 LN:41543 @SQ SN:KI270733.1 LN:179772 @SQ SN:KI270734.1 LN:165050 @SQ SN:KI270735.1 LN:42811 @SQ SN:KI270736.1 LN:181920 @SQ SN:KI270737.1 LN:103838 @SQ SN:KI270738.1 LN:99375 @SQ SN:KI270739.1 LN:73985 @SQ SN:KI270740.1 LN:37240 @SQ SN:KI270741.1 LN:157432 @SQ SN:KI270742.1 LN:186739 @SQ SN:KI270743.1 LN:210658 @SQ SN:KI270744.1 LN:168472 @SQ SN:KI270745.1 LN:41891 @SQ SN:KI270746.1 LN:66486 @SQ SN:KI270747.1 LN:198735 @SQ SN:KI270748.1 LN:93321 @SQ SN:KI270749.1 LN:158759 @SQ SN:KI270750.1 LN:148850 @SQ SN:KI270751.1 LN:150742 @SQ SN:KI270752.1 LN:27745 @SQ SN:KI270753.1 LN:62944 @SQ SN:KI270754.1 LN:40191 @SQ SN:KI270755.1 LN:36723 @SQ SN:KI270756.1 LN:79590 @SQ SN:KI270757.1 LN:71251 @PG ID:STAR PN:STAR VN:2.7.10b CL:STAR --runThreadN 12 --genomeDir TETranscripts_refs/TE_index2/ --readFilesIn fastp_out/A1_CH077.fastq.gz --readFilesCommand zcat --outFileNamePrefix STAR_TETranscripts2/A1_CH077_TE_ --outSAMtype BAM SortedByCoordinate --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 @CO user command line: STAR --runThreadN 12 --genomeDir TETranscripts_refs/TE_index2/ --readFilesIn fastp_out/A1_CH077.fastq.gz --readFilesCommand zcat --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix STAR_TETranscripts2/A1_CH077_TE_

Here are those two files. I had to zip the featurecounts output. The featurecounts table is for X_CH077 if you want to compare it directly. X_CH077_hg38_featurecounts_exon_test.txt.gz CH077_to_mock_sense_only.cntTable.txt

It seems to me that it's a counting problem? The same genes has far fewer counts in the TEtranscripts cnt table?

olivertam commented 3 months ago

Hi,

You are right. The gene counts are approximately half in the TEtranscripts count table vs the featureCounts table. Just by looking at it, I can't seem to work out why there is a difference. Given your FeatureCounts command line, you should be only counting uniquely mapped reads there as well, which is even weirder. I hate to bother you further, but would it be possible to give me one of your BAM files (or a small-ish subset with at least 1 million alignments)? I just want to see whether I can reproduce the issue on my end (I just tested on a file that I have previously analyzed, and I get pretty similar counts for GAPDH and ACTB between featureCounts and TEtranscripts).

Thanks.

murphytho commented 3 months ago

Gladly! Again, I've used TEtranscripts in the past and it worked great (at least-- I'm pretty sure it did!)

I'm not sure this has any meaning whatsoever, but I am also running these files through a program called Telescope (https://github.com/mlbendall/telescope) and I am having trouble there too. It's a totally different pipeline (I used Bowtie2 for that), so it is likely to be a completely different issue, but essentially I am getting 0 differentially expressed HERVs when I expect hundreds.

Github has file size limits, so I figured I would upload to a google drive so you can have the whole BAM. Just say the word and I will upload raw read file too.

Here is the code used to make the alignment: STAR --runThreadN 12 --genomeDir TETranscripts_refs/TE_index2/ --readFilesIn fastp_out/X_CH077.fastq.gz --readFilesCommand zcat --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix STAR_TETranscripts2/X_CH077_TE_

And below is a link to the alignment. It's only 1gb, is this low? The data is single-end, 50 bp reads, so I chalked up the small size to that. https://drive.google.com/file/d/1766RNBL-tzdOeMJ1kwIHEqD1h-iiiHQN/view?usp=sharing

Thanks again for your help.

olivertam commented 3 months ago

Hi,

I'm testing your file now. I can't say whether 1Gb is a small file or not. I would definitely check the mapping logs from STAR to make sure that you're getting around 70-80% mapping rate. I'm also curious as to how you're building your index. Since we don't have splicing information on TE, we typically use a standard STAR index (with the corresponding gene GTF for splice junctions). I'm sure that shouldn't affect it, but I am just curious. I will get back to you once I run the tests.

Thanks.

murphytho commented 3 months ago

I actually had the same question about the index, So I tried it both ways— with the human gtf and the TE gtf. No difference.

As for STAR stats, I get around 70-80% uniquely mapped reads and 15-20% multimapped reads from STAR in the TEtranscripts pipeline. Pretty good from my perspective?

olivertam commented 3 months ago

Yes, that's what you would expect in most cases. And I was able to replicate your error. I'm now exploring further.

Thanks.

olivertam commented 3 months ago

Hi,

Thank you for your patience. I think I have finally worked out the issue. It appears that your sequence name (which are all numbers) are not unique. I don't know why that is the case (you might want to check your FASTQ), but because TEtranscripts assumes that each read name is unique, it groups together alignments with the same read name and treats it as a single read.

Here are the command lines I used to determine this:

# Selecting for unique mappers with the tag of 'NH:i:1'
$ samtools view X_CH077_TE_Aligned.sortedByCoord.out.bam | grep -w "NH:i:1" | wc -l
21707346
$ samtools view X_CH077_TE_Aligned.sortedByCoord.out.bam | grep -w "NH:i:1" | cut -f 1 | sort -u | wc -l
7522608

FeatureCounts doesn't have this issue because it's looking for the NH:i tag to determine unique/multi, rather than depending on the read name. I believe the rationale for TEtranscripts approach is that we initially developed the software to handle aligners that don't use the NH:i tag (e.g. bowtie2), but since STAR became the recommended option (and it does generate that tag), perhaps we could adapt to featureCounts' approach to avoid encountering this particular issue (though I still find it unusual to have the same read ID/name for multiple FASTQ sequences).

Thanks for identifying this quirky bug.

murphytho commented 3 months ago

Of course! I noticed this when I first got the files, and I did some looking around to see if it would be an issue, but couldn’t find many people who had dealt with files like this. When featurecounts worked fine, I sort of just forgot about it. I bet this is the source of my other issues too.

I pooled fastqs from 4 lanes to make these files. Would the easiest fix be to simply go back and add a unique identifier to each original fastq? For example, adding a, b, c, and d so that the concatenated fastq has no duplicates?

Thank you so much for working this out. I would have been banging my head on the wall for at least another couple weeks before thinking of this!

olivertam commented 3 months ago

Hi,

Your suggestion of adding something to the read name for each FASTQ file would probably do the trick. E.g.:

COUNTER=0
for FILE in *.fastq
do
    COUNTER=$((COUNTER+1))
    sed -i '/^>/s/>/>lib${COUNTER}_/' ${FILE}
done

cat *.fastq > combined.fastq

Good luck with the new run, and let me know if you encounter more issues.

Thanks

murphytho commented 3 months ago

I will try it out! Thank you so much again for your help. You’ve saved me hours!

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

mhammell-laboratory / TEtranscripts

Very low basemeans compared to another pipeline #183