Tiny differences for two identical BAM files (except for read headers)

raquelgarza commented 2 years ago

Hi!

I have two BAM files where the contents are identical except for the headers of the reads. I ran TEcounts for both and I am getting tiny differences that I cannot explain... For example, gene_A using the first BAM file has 50 counts in the output cntTable, and the second BAM file has 51. This is the case for many genes and TE subfamilies (not all of them though).

The only difference between the two BAM files are the headers of the reads. I checked this as:

samtools view file1.sam | awk '{$1=""; print $0}' | sort | md5sum
70c5f50298fd9d9e85f5a302a0467902  

samtools view file2.sam | awk '{$1=""; print $0}' | sort | md5sum
70c5f50298fd9d9e85f5a302a0467902

The headers in one of the files are standard (e.g. A00681:301:HWFL5DMXX:1:1214:5050:5854), and in the other one I used --outSAMreadID Number from STAR, which output the reads with a numeric ID.

I am running TEcounts as

TEcount -b file1.sortedByCoord.bam --GTF gencode.v38.annotation.gtf --TE hg38_rmsk_TEtranscripts.gtf --format BAM --stranded yes --mode multi --sortByPos --project file1_

Any idea why this could happen?

Thanks!

olivertam commented 2 years ago

Hi,

I have a few questions: 1) What is the version of TEcount that you're using? 2) Could you confirm that you're getting as many "distinct" read ID in both files? I.e. if you do following:

samtools view file[1/2].bam | cut -f 1 | sort -u | wc -l

Are the values the same? 3) Could you confirm that you're using the exact same parameters for the maximum allowed alignments (outFilterMultimapNmax)?

If those questions do not resolve the issue, we could take a look at a sample of the BAM files and determine if there are differences in the assignment of alignments to the expected read ID.

Thanks.

raquelgarza commented 2 years ago

Hi,

Thank you so much for the quick reply! :-)

I'm using version 2.0.3.
Yes, the numbers checked using the command you sent
Yes, I'm using the exact same parameters when mapping

How do you prefer me to send you the BAM files?

Best,

olivertam commented 2 years ago

Hi,

Would you mind trying with the latest version of TEcount (v2.2.1)? There might have been a minor fix that dealt with that. How big are your BAM files? If they are >5Gb, let us try to replicate your error first with our test files, and if we didn't reproduce it (and you did with the latest TEcount), we can try to transfer them over.

Thanks.

olivertam commented 2 years ago

Hi,

Just a quick update. We ran TEcount v2.2.1 on a BAM that either used the original read ID or just numbers. We do notice slight differences in the counts, but mainly in the TE, and usually by a small number. I suspect that this might be due to slight variation in the "seeding" or running of the EM steps, such that some reads might be redistributed slightly differently. In the cases of off-by-one, I suspect that might be rounding differences as well. If you are noticing issues with many genes, please let us know.

Thanks.

raquelgarza commented 2 years ago

Hi,

I updated my software to v2.2.1 and reran the files. Between these two files I get differences in ~140 genes and ~224 TE subfamilies. My gene GTF file has 60,660 genes (grep -w gene gencode.v36.annotation.gtf | wc -l), so this is a small difference indeed.

I did an extra test. For it, I made three files (1) one with the original headers, (2) one with the numeric headers and (3) another one with numeric headers but numbered in a different order as the file (2).

So for example for a given read file1.bam I have a header A00681:301:HWFL5DMXX:1:1214:5050:5854, file2.bam had a header of 10000 and file3.bam had a header of 30000.

Interesting observations:

The numeric-header files (file2.bam and file3.bam) differ in a very close number of features being different compared to the stardard-header file (file1.bam).

diff file1.cntTable file2.cntTable | awk '{print $2}' | sort | uniq | wc -l
364
diff file1.cntTable file3.cntTable | awk '{print $2}' | sort | uniq | wc -l
396

The difference between the two numeric-header files is much smaller

diff file2.cntTable file3.cntTable | awk '{print $2}' | sort | uniq | wc -l
163

I also scrolled a while in the diffs and I found some genes that were off by ~10 counts (I took note of ENSG00000233024.7 and ENSG00000183889.12, but I can probably find more if you want more examples).

A seeding issue was my thought too, but I don't notice this issue if I run the same file twice. Is the seed in anyway attached to the headers?

These are "small" bam files of ~800M. Let me know if you want me to send them over.

olivertam commented 2 years ago

Hi,

Thanks for checking. We'll take a look and see if there is an obvious cause for this. I will provide instructions on uploading to your email.

Thanks.

olivertam commented 2 years ago

Oh, I just noticed that you're using GENCODE annotations. Please note that there are several ncRNA that overlap significantly with TE annotations, and they could cause ambiguity in the quantification. We have been debating how best to integrate GENCODE and TE annotation given the overlap (i.e. some ncRNA are just copies of TE). I wonder if that's what causes the differences, but we'll still take a look at it.

Thanks.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

annsophiegironne commented 1 year ago

Hi @olivertam ! I hope you're doing well. I was going through issues and stumbled upon this one. I ran TEcount on my samples using the following script:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=40gb
#SBATCH --time=4:00:00

sample=$1
path=/u/gironnea/tnbc/scratch/asg/star

TEcount -b $path/${sample}_Aligned.sortedByCoord.out.bam \
        --stranded forward \
        --mode multi \
        --sortByPos \
        -L 85 \
        --project ${sample}_ \
        --outdir /u/gironnea/tnbc/scratch/asg/TE \
        --GTF /u/gironnea/tnbc/annotations/Homo_sapiens.GRCh38.Gencode40.gtf \
        --TE /u/gironnea/tnbc/annotations/GRCh38_GENCODE_rmsk_TE.gtf

and when dealing with my differential expression analysis, I had no differentially expressed TE families for logFC>1 at all... Could that possibly be linked to the Gencode annotations ? And if so, what annotations would you suggest using instead? I would just really be surprised if I had no significant differences at all. Thanks in advance !

olivertam commented 1 year ago

Hi,

Thank you for your interest in the software.

We have looked more carefully in to the GENCODE annotations, and while there are some overlap, I don't think that is having as big of an impact as we thought (we also use Cell Ranger's recommendations on GENCODE annotation filtering).

Regarding the fold changes, are you using DESeq2 for your differential analysis? We have noticed that due to the "higher" read counts of TE subfamilies, we don't often see huge l2fc even for statistically significant TE changes (which we define as FDR < 0.05). We have typically used a l2fc threshold of 0.3 or 0.5 for them instead, though mainly for the purposes of being able to validate via another approach (e.g. qPCR).

Also, I'm assuming that you made sure your strandedness is correct, i.e. that you are sequencing the 2nd cDNA strand, and not the 1st strand (which would be reverse stranded).

Hope this is helpful.

Thanks.

annsophiegironne commented 1 year ago

Hi,

Yes, I use DESeq2 for my differential analysis, and yes, I also made sure of my strandedness. I'll keep looking into it. Thanks for the quick response, it is much appreciated.

mhammell-laboratory / TEtranscripts

Tiny differences for two identical BAM files (except for read headers) #110