mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Error thrown when using paired end BAM #159

Closed Bessie92 closed 6 months ago

Bessie92 commented 7 months ago

Hi there! I'm trying to run TEtranscripts on some mice data. I have BAM files generated from the STAR aligner, and the index file .bam.bai was generated using samtools (under the same directory). However, I run into the error:

Screen Shot 2023-11-06 at 1 35 34 PM

I do have --sortByPos specified in my command, and my bam files were sorted. I would greatly appreciate any insight and suggestion! Thank you.

olivertam commented 7 months ago

Hi,

Are you doing paired-end mapping, and are you ensuring that no unpaired reads are output into the BAM file (e.g. no singletons)? The error suggests that there are pairs where one of the read is unmapped/not present in the BAM. Otherwise, would you be mind posting an excerpt of your BAM file?

Thanks.

Bessie92 commented 7 months ago

Hi,

Thanks for the reply! Yes, I'm using paired-end mapping. I went through all my bam files using samtools flagstat and it seemed there were no singletons:

Screen Shot 2023-11-07 at 10 30 34 AM

Thank you!

olivertam commented 7 months ago

Hi,

Would you be willing to share the BAM file? I would like to see if I can reproduce the error.

Thanks.

olivertam commented 7 months ago

Actually, I just noticed a weird output. There are more read 1 than read 2, which would suggest that there are more alignments in one than the other. This would suggest that there are unpaired "alignments", rather than reads.

This would certainly caused issues. Would you mind sharing the BAM header of your file, or the STAR command line?

Thanks.

Bessie92 commented 7 months ago

Certainly. This is the header of one of my BAM file:

Screen Shot 2023-11-07 at 10 50 03 AM

And this is my STAR command:

my $STAR_cmd = "$STAR --runThreadN $num_thread --genomeDir $gIndices --readFilesIn $read1 $read2 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --
quantMode GeneCounts --outFilterMultimapNmax 1 --outSAMmultNmax 1 --outSJfilterReads Unique --outSAMstrandField intronMotif --outFilterMismatchNoverReadLmax 0.
01 --outSAMattributes All XS" ;

Thank you!

olivertam commented 7 months ago

Hi,

Thanks for your response. I have several comments: 1) I noticed that in your BAM file, A00977:127:HFTCLDSXY:2:1507:18819:26162 (line 3) is showing that it's a paired read with its pair mapped (column 8 indicates the genomic location of the pair), yet there is no entry for that line. If the BAM is sorted by coordinate, you should still see that pair's alignment on line 4 (instead of A00977:127:HFTCLDSXY:2:1303:27272:4977). This suggests that your BAM file is missing the pair's alignment, despite the flag (and column 8 of line 3) indicating otherwise. 2) You are using --outFilterMultimapNmax 1, which means that you're only taking unique mappers. While this is technically compatible with TEtranscripts, it does mean that all multi-mappers are removed, and thus removes most of the utility of using TEtranscripts (as opposed to other quantification methods). We typically recommend (for mouse) to allow --outFilterMultimapNmax 100. 3) You are using --outSAMmultNmax 1, which means that only one set of alignment will be produced per read. This is not problematic in your current run as you are only working with unique mappers, but it will be problematic if you want to include multimappers, as you will "randomly" select one alignment if more than one was identified. This would break the assumption of TEtranscripts, which expects to see all "applicable" alignments in order to quantify repetitive elements (e.g. TE).

To address your original question, it appears that there are missing alignments in your BAM file (despite the flags indicating otherwise), and thus TEtranscripts is throwing the error that you indicated. However, it also appears that your STAR parameters may not be optimal for quantifying TE using TEtranscripts, as you are discarding all multi-mappers.

Please let me know if you have further questions.

Thanks.

Bessie92 commented 7 months ago

Yes, seems like all my bam files contain different numbers of read 1 and 2. Don't exactly know why but I will use multimapping in STAR to regenerate the BAM files. Thank you so much for your suggestions!

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days