0 counts for TEs - Githubissues

savytskanatalia commented 4 years ago

Good afternoon! I am very much interested in using TElocal and I was trying it out with two "toy" simulated datasets, when I ran into a trouble with 0 counts being assigned to all TE loci. I have simulated two mm9 RNAseq datasets (~34 million and ~48 million reads respectively, simulation for both genes and TEs), that I mapped with STAR_2.5.3a (options --outFilterMultimapNmax 100 --winAnchorMultimapNmax 100) and proceeded to quantify with TElocal, in both unique and multi modes. For both datasets and both those quantification modes zero counts were assigned to TEs. Genes had non-zero counts. I also ran TElocal with RepMask.gtf and regular TEcounts. Both times counts for TE subfamilies were non-zero (just as I would expect them to...). So I was wondering, what could be the cause for no counts being assigned on loci level? And if there could be a problem on my side somehow with installation of TElocal or something else? I would really appreciate it, if you could maybe point me to the dataset, that you used for testing the tool, for which non-zero loci-level TE counts should be produced?

olivertam commented 4 years ago

Hi, Would you mind providing the command line that you used in your run? If you could also provide a few lines of your gene GTF file, RepMask.gtf file, and the header for your BAM file, that would also be very helpful. We are now assembling the test files, and will get back to you as soon as possible. Thanks.

olivertam commented 4 years ago

Hi, We have uploaded some test datasets for TElocal here. The details are described in this file. Please let me know if you have any issues. Thanks.

savytskanatalia commented 4 years ago

Hi! Thank you very much for the provided test files! I ran TElocal with them as described in the provided INFO.txt, and the results were identical to the test_data output count table (for SE), so I guess everything is OK with my TElocal installation...

The command line I used for unique mode was: TElocal -b input/sample06.bam --GTF GTF/Mus_musculus.NCBIM37.67.gtf --TE GTF/mm9_rmsk_TElocus.ind --project TElocal_U_sample06 --mode uniq The command line I used for multi mode was: TElocal -b input/sample06.bam --GTF GTF/Mus_musculus.NCBIM37.67.gtf --TE GTF/mm9_rmsk_TElocus.ind --project TElocal_U_sample06

my .bam`s are unsorted

The first lines of my gene GTF file (that I also used for STAR genome index generation) are:

18  unprocessed_pseudogene  exon    3026901 3027882 .   -   .    gene_id "ENSMUSG00000093774"; transcript_id "ENSMUST00000176956"; exon_number "1"; gene_name "Vmn1r-ps151"; gene_biotype "pseudogene"; transcript_name "Vmn1r-ps151-001";
18  unprocessed_pseudogene  exon    3080778 3081476 .   -   .    gene_id "ENSMUSG00000093444"; transcript_id "ENSMUST00000176452"; exon_number "1"; gene_name "Vmn1r-ps152"; gene_biotype "pseudogene"; transcript_name "Vmn1r-ps152-001";
18  protein_coding  exon    3122455 3123465 .   -   .    gene_id "ENSMUSG00000091539"; transcript_id "ENSMUST00000165255"; exon_number "1"; gene_name "Vmn1r238"; gene_biotype "protein_coding"; transcript_name "Vmn1r238-001";

For TEtranscripts I use RepMask.gtf of the following format:

1   mm9_rmsk    exon    100000003   100000213   .   +   .   gene_id RMER19C; transcript_id RMER19C_dup153; family_id ERVK; class_id LTR;
1   mm9_rmsk    exon    100000302   100000475   .   +   .   gene_id MLT1E2; transcript_id MLT1E2_dup79; family_id MaLR; class_id LTR;
1   mm9_rmsk    exon    100000687   100000888   .   +   .   gene_id URR1A; transcript_id URR1A_dup630; family_id MER1_type; class_id DNA;

As for loci-level quantification, I used mm9_rmsk_TElocus.ind.gz, that you provide... Now that I think of this, it may be the chromosome naming ("chr1" versus "1"), that contributed to generation of 0 results for TEs?

Would it be possible for you to help me generate the .ind file based on the custom .gtf annotation file?

olivertam commented 4 years ago

Hi, It appears that you have identified the problem ("chr1" vs "1"). That is, unfortunately, a common cause for this type of error. I am looking at your RepMask.gtf file, and I don't think it can be used for loci-level quantification, as the gene_id (which is what we use for the annotation) is different from the transcript_id (which suggests that this is more useful for TEcount/TEtranscripts). Currently, it takes a long time to generate the index (days), so that's why we are trying to provide ones where we can. You can either wait for us to build the index, or you can use the GTF file available here to run TElocal. This GTF uses the Ensembl chromosome nomenclature, and thus should be compatible with your BAM files. Please note that it will still take time to build the index (and unfortunately would not save it), but at least you can get started while we build a loadable index. Please let me know if you have any questions. Thanks.

savytskanatalia commented 4 years ago

Hi, It appears that you have identified the problem ("chr1" vs "1"). That is, unfortunately, a common cause for this type of error. I am looking at your RepMask.gtf file, and I don't think it can be used for loci-level quantification, as the gene_id (which is what we use for the annotation) is different from the transcript_id (which suggests that this is more useful for TEcount/TEtranscripts). Currently, it takes a long time to generate the index (days), so that's why we are trying to provide ones where we can. You can either wait for us to build the index, or you can use the GTF file available here to run TElocal. This GTF uses the Ensembl chromosome nomenclature, and thus should be compatible with your BAM files. Please note that it will still take time to build the index (and unfortunately would not save it), but at least you can get started while we build a loadable index. Please let me know if you have any questions. Thanks.

Thank you very much for the reply! For now I`ll use the compatible GTF you attached to run TElocal, thank you for providing it.

I will be looking forward to you releasing loadable index in the future :)

mhammell-laboratory / TElocal

0 counts for TEs #4