mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Failure to recognize sorted ATACseq BAM file #45

Closed dango147 closed 4 years ago

dango147 commented 4 years ago

I was trying to use TEtoolkit to quantify signal at transposable elements from a chromatin accessibility assay (ATAC-seq). At first, I tried aligning the ATAC-seq data with BWA aln/sampe, removing duplicates with PicardTools, and sorting with samtools. I used these BAM files with TEtoolkit as follows:

TEcount -b file.bam --GTF gene_annotation.gtf --TE TE_annotation.gtf --sortByPos --stranded no --mode multi --project outputTE

Unfortunately, I get the following error:

**NOT COMPLETE** If the BAM file is sorted by coordinates, please specify --sortByPos and re-run!

Since I saw others have had similar problems in one of the other threads, I also tried sorting by read name with samtools and excluding the sortByPos argument, and still got the same error.

Then, since I have used TEtoolkit with RNA-seq data (mapped with STAR) with no problems, I decided to use STAR for ATACseq alignment and duplicate removal. I sorted the resulting BAM file with samtools and used that as input for TEtoolkit, with the same arguments (including --sortByPos), and got the same error again. Now, I am wondering if this is a problem caused by samtools sort, or if it is happening because I am aligning ATACseq data instead of RNAseq? Any input would be much appreciated, thank you.

olivertam commented 4 years ago

Hi,

Thank you for your interest in the TEtoolkit. We have had queries about ATAC-seq libraries not working with TEtranscripts, and we think that there are many differences in the pipeline that might make them incompatible at this stage.

In our opinion, ATAC-seq (like ChIP-seq) is more focused on finding the specific position/locus of the transposable elements, whereas the TEtoolkit is designed to quantify TE through pooling of multiple loci of the same transposable element sub-family (e.g. L1HS). We are in the process of developing tools that are more suitable for these locus-centric applications, but we don't think that TEtoolkit can currently do this analysis properly.

Please let me know if you have other questions. Thanks

dango147 commented 4 years ago

Hello,

The pipeline is just STAR, nothing else. I would like to get this working as a test to validate other observations, and I am still having issues.

The initial output from STAR alignment (sorted by coordinate by STAR) works with TEtoolkit; I get counts. However, I would like to remove duplicates from my data, and when I use STAR to remove duplicates (again the output is sorted by coordinate with STAR), the resulting alignment file is incompatible with TEtoolkit, producing the initial error that I reported.

Could this problem be related to the duplicate flags?

Thank you for your input!

olivertam commented 4 years ago

Hi,

Based on where the error message appears to originate, it appears that there might be singletons in the BAM files (where only one end of the read is mapped). One way to check this is to get the read names of all the alignments, and check that they are all paired (i.e. all read names are present in multiples of 2). STAR might have removed/marked only one of the paired alignments when doing the duplicate removal, but we haven't tested this functionality much.

Thanks.

olivertam commented 4 years ago

Hi,

If you could send a small excerpt of your BAM file (say around 10,000 alignments), I can try to troubleshoot it further. You can email it to tam at cshl dot edu

Thanks