mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
229 stars 30 forks source link

Transposable elements have zero counts in .cntTable #136

Closed 14stutzmanav closed 1 year ago

14stutzmanav commented 1 year ago

Hi there,

I am trying to use a custom list of transposable elements that can be found here: https://github.com/bergmanlab/drosophila-transposons/tree/master/current. I used the makeTEgtf perl script that is recommended in other issues to convert transposon_sequence_set.gff to a gtf file. Then I aligned my data to a fasta file was made by combining dm6 and transposon_sequence_set.fa. I used those bam files as input to TEtranscripts with dm6 genes as the gtf for genes, and the output of makeTEgtf.pl as the TE gtf. Here are a couple lines from the resulting TEtranscripts_out.cntTable:

"zpg"   380     244     476     10      7     
"zuc"   254     220     231     261     149   
"zyd"   44      44      431     57      40    
"zye"   159     1675    1073    507     830   
1:1:1   264637  198826  269381  111467  75378 
5S_DM:RNA:RNA   0       0       0       0       0     
ACCORD2_I-int:Gypsy:LTR 0       0       0       0       0     
ACCORD2_LTR:Gypsy:LTR   0       0       0       0       0  

The protein coding genes are in quotes but the transposable element names are not. Is this expected behavior? Furthermore, I have no counts at any of my transposable elements. I am worried that I converted the TE gtf incorrectly.

olivertam commented 1 year ago

Hi,

Thank you for your interest in the software.

I looked at the gff file that is provided by the Bergman lab, and it appears to be annotations based on the TE sequence (i.e. the "genome" is the TE sequences themselves, e.g. FBte0000104). Since TEtranscripts is taking a BAM file after aligning to the (in your case) Drosophila genome, it won't be able to annotate to the TE since there will no alignments for those TE sequences. This is why the gene annotations worked, but the TE did not. And yes, the quotes around the genes are expected.

In order to use the Bergman annotations, you will need to align the TE that they have identified (e.g. with their FASTA file) onto your Drosophila genome build, and then convert the GFF coordinates to the Drosophila genomic coordinates in order to use those annotations.

Hope this is informative. Please let us know if you have further questions.

Thanks

14stutzmanav commented 1 year ago

Thank you so much for your fast response!

When I generated these files, I used a bam file that was aligned to a "custom" fasta that I made by doing cat dm6.fa transposon_sequence_set.fa > custom.fa. I'm not sure that the TEs in transposon_sequence_set.fa are found in the Drosophila annotation. Perhaps I should try aligning to transposon_sequence_set.fa on its own?

olivertam commented 1 year ago

Hi,

I'm afraid that you might have to align to the FASTA separately to "annotate" these TE. I wonder if you could reach out to the group who generated the data to see if there are genomic coordinates available for Drosophila, or if these are in the "unmappable" portions of that genome.

Thanks.

14stutzmanav commented 1 year ago

Hi Oliver,

Just to make sure I understand, you suggest aligning my data to the transposon FASTA and then use that Bam as input for TEtranscripts? I know there are transposons represented in the Bergman Lab's list that fall in the unmappable portions of the dm6 genome.

Thank you again!

olivertam commented 1 year ago

Hi,

I think it might be best to align your data to the FASTA file, and then quantify the number of reads that align to each TE (bypassing the use of TEtranscripts). If the TE annotations contains introns, then you "could" try to use TEtranscripts to quantify the TE (using the TE-aligned BAM as input, the original gene GTF as genic GTF, and your custom TE GTF). You would not get any gene counts, but might get the TE counts.

You can then try to combine either approaches with the gene counts from TEtranscripts.

It is not the design of our tool, but it might be the most usable in your situation.

Thanks.

zjanna commented 1 year ago

Hi, I tried to use TEtranscripts to get some information about transposable elements. First I run the following code to make the star index:

STAR --runThreadN 10 --runMode genomeGenerate --genomeDir ~/index/ --genomeFastaFiles ~/index/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile ~/index/Homo_sapiens.GRCh38.98.gtf 

and after that I just aligned my data into this reference:

STAR --runThreadN 64 --genomeDir ${ref1}/index/ --sjdbGTFfile ${ref1}/Homo_sapiens.GRCh38.98.gtf \
 --sjdbOverhang 100 --readFilesIn ${PATH_in}_trimmed_1.fastq ${PATH_in}_trimmed_2.fastq \
 --outSAMtype BAM Unsorted --winAnchorMultimapNmax 200 --outFilterMultimapNmax 100 \
 --outFileNamePrefix ${PATH_out}

I used those bam files as input to TEtranscripts:

TEtranscripts --format BAM --stranded reverse -t /home/users/zjanna/DANE_IGcz/MGR/TEtranscripts/BAM_trimmed/case_RBMXL3IAligned.out.bam /home/users/zjanna/DANE_IGcz/MGR/TEtranscripts/BAM_trimmed/case_RBMXL3IIAligned.out.bam \
-c /home/users/zjanna/DANE_IGcz/MGR/TEtranscripts/BAM_trimmed/control_eGFP_IAligned.out.bam /home/users/zjanna/DANE_IGcz/MGR/TEtranscripts/BAM_trimmed/control_eGFP_IIAligned.out.bam \
--GTF ~/idex/Homo_sapiens.GRCh38.98.gtf \
--TE ~index/hg38_rmsk_TE.gtf \
--mode multi --project case_vs_control --minread 1 -i 10 --padj 0.05 

And from this I got following information:

[E::idx_find_and_load] Could not retrieve index file for '/home/users/zjanna/DANE_IGcz/MGR/TEtranscripts/BAM_trimmed/case_RBMXL3IAligned.out.bam'

Finally, I got the out.cntTable but I have no counts at any of my transposable elements. Do you know the reason for this? Why the index file for my bam's couldn't be retrieved?

olivertam commented 1 year ago

Hi,

Thank you for your interest in the software.

The index warning message has no impact on TEtranscripts (see issue #82).

The reason that you are getting no counts in the TE is because there is a mismatch in the chromosome names between the genome build (which is GRCh38 Ensembl) and your TE GTF (which is UCSC hg38). If you did obtain your FASTA and gene GTF from Ensembl, we would recommend using this TE GTF instead.

Please let us know if you encounter further issues.

Thanks

zjanna commented 1 year ago

Thank you very much for your quick reply. This will solve my problem! Best, Joanna

zjanna commented 1 year ago

Hi, unfortunately replacing the gif file didn't solve the problem as I thought. After the code:

TEtranscripts --format BAM --stranded reverse -t ~/case_RBMXL3IAligned.out.bam ~/case_RBMXL3IIAligned.out.bam \
-c ~/control_eGFP_IAligned.out.bam ~/control_eGFP_IIAligned.out.bam \
--GTF ~/index/Homo_sapiens.GRCh38.98.gtf --outdir ~/TEtranscripts/ \
--TE ~/index/GRCh38_Ensembl_rmsk_TE.gtf --mode multi --project case_vs_control --minread 1 -i 10 --padj 0.05

I'm getting the following message now:

INFO  @ Wed, 03 May 2023 17:41:01: Done building gene index ...... 

INFO  @ Wed, 03 May 2023 17:41:14: Building TE index ....... 

Error in building TE index 

I can't guess what could be the problem..

olivertam commented 1 year ago

Hi,

I'm assuming that you decompressed the GTF file after downloading (but if not, that could be an issue). The command line looks fine, and judging from the fact that it worked previously, your input files should be fine. I'm now testing the GTF on my end, but if you could confirm that you did decompressed the downloaded file, then that would eliminate one cause.

Thanks.

olivertam commented 1 year ago

Hi Joanna,

I'm afraid I can't recapitulate the error (I was able to build the TE index). If your issue isn't resolved, please feel free to send me one of your BAM files, and I can troubleshoot with that.

Thanks.

zjanna commented 1 year ago

Hi, yes I made:

wget https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/GRCh38_Ensembl_rmsk_TE.gtf.gz

gunzip GRCh38_Ensembl_rmsk_TE.gtf.gz

and now I have just gtf file..

zjanna commented 1 year ago

where can I send you BAM file?

olivertam commented 1 year ago

My email is tam at cshl dot edu. You can just take the first 10000 lines of your BAM file:

$ samtools view -h [BAM] | head -n 10000 | samtools view -b -o [new BAM] -

If it's small enough, it should make it through email. If not, I have Dropbox if you need me to share a folder for you to transfer the file into.

zjanna commented 1 year ago

Thank you for your help. I will try!

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days