mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

standalone command for TEindex ? #58

Closed SolKatzman closed 4 years ago

SolKatzman commented 4 years ago

I see the recent support to allow gtf.ind input to TEtranscripts.

It worked fine with the hg38_rmsk_TE.gtf.ind that I downloaded from your site.

I would like to do the same from my gene.gtf file.

Is there an easy standalone command to create the ".gtf.ind" from a ".gtf" file?

I want to run TEcount in parallel on multiple samples. Makes sense to only create the index once.

Thanks, Sol Katzman, UCSC Genomics Institute.

olivertam commented 4 years ago

Hi Sol,

We don't think the indexer is robust enough yet, so we haven't made it available to the public. However, we can certainly help with making your gene GTF file into an indexed version. Is it a publicly available GTF, or a custom one?

Thanks.

SolKatzman commented 4 years ago

Dear Oliver,

That would be excellent. Alternatively, if there is a tweak I can make to the python code to write out the index that it builds, that would be easy too. It looks like it took about 1/2 hour:

INFO @ Thu, 16 Jan 2020 16:56:07: Building gene index ....... ... 1400000 GTF lines processed. INFO @ Thu, 16 Jan 2020 17:27:04: Done building gene index ......

We use custom GTF files created using the "CAT" methodology developed (and published) by Ian Fiddes. We have models for several primates that use a common gene and transcript namespace for easy comparative genomics work. They combine Genode with FANTOM transcript definitions.

About 30MB per gzip compressed gtf.

I don't want to make them public at this point, so I will send you a private email with pointers.

I am also interested in your resolution of issue #49 about overlapping transcripts, since these very comprehensive gene models might have a fair number of them.

/Sol.

olivertam commented 4 years ago

Hi Sol,

Thank you for your response.

As you pointed out, the indexer is actually a tweak that writes out the index that it builds. However, we're still trying to see if there are improvements to the methodology (some of the GTF files that we're testing are taking days to build their index), which is why we're not making it public yet. You are welcome to try and tweak your code to print out your indices if you are already building them. We used the cPickle module, and the following code to write out the index

import cPickle as pickle
pickle.dump(geneIdx, filehandle, protocol=2)

All you need is to make sure that the suffix is .ind, and TEcount should treat them as pre-built indices. Otherwise, feel free to email me at tam at cshl dot edu, and I can run the GTF through our indexer.

Regarding issue #49, I don't think we have fully resolved it. One big issue that we have with Gencode annotations is that they are including many repetitive sequences (e.g. 7SK RNA) or non-coding RNA that (almost) completely overlap TE annotations. Our current workaround is to remove transcripts that have over 75% of their sequence (exonic only) overlapping the RepeatMasker annotations. This somewhat decreases the re-distribution issues, and prevents the repetitive "genic" annotation from gobbling up all the uniquely mapped reads. We are currently thinking about improving the EM portion to take into account both genic and TE annotations (currently, we're limiting to TE annotations since they are more problematic). That could be a possible fix, though we would have to test it extensively before we are certain.

Thanks.

SolKatzman commented 4 years ago

Dear Oliver,

Thank you for supplying (offline) the index files for our 3 primate gene GTF files.

Also thank you for posting (on your download site) the TE_rmsk GTF and index files for the recent assemblies:

chimp: panTro6 orang: ponAbe3

I verified the index version of our hg38 GENE GTF by running TEcount on one bam file with the plain GTF and with the index. There were a very small handful of differences in the cntTable output (see below). Do you think this is due to some stochastic difference in the algorithm?

Thank you again, Sol.

diff sol-70638-1579222525_test1/hipsctrld048.cntTable \ sol-70638-1579222525_test2/hipsctrld048.cntTable

43085c43085 < "ENSG00000189319.14" 324 --- > "ENSG00000189319.14" 319 47584,47585c47584,47585 < "ENSG00000210127.1" 7 < "ENSG00000210135.1" 4 --- > "ENSG00000210127.1" 0 > "ENSG00000210135.1" 3 47587c47587 < "ENSG00000210144.1" 18 --- > "ENSG00000210144.1" 21 56558c56558 < "ENSG00000231887.7" 24 --- > "ENSG00000231887.7" 25 65094c65094 < "ENSG00000250299.6" 33 --- > "ENSG00000250299.6" 32 70523c70523 < "ENSG00000258539.1" 131 --- > "ENSG00000258539.1" 135 78662c78662 < "ENSG00000273784.4" 30 --- > "ENSG00000273784.4" 31

olivertam commented 4 years ago

Hi Sol,

That is unusual. We haven't noticed any difference between indexed and raw GTF in our testing. Let me run your GTF (and index) with our validation sets, and see if they are showing differences.

Thanks.

olivertam commented 4 years ago

Hi Sol,

I just ran our datasets (single-end 52nt or paired-end 150nt) using the raw and indexed GTF files, and did not see a difference. I'll dig into it further.

Thanks.

SolKatzman commented 4 years ago

Dear Oliver,

Okay. I think we must have gotten wires slightly crossed between the gtf that I supplied and the ind that you returned for me. I generated my own ind per your instructions above:

geneIdx = GeneFeatures(args.gtffile, args.stranded, "exon", "gene_id") gidxout = open('%s' % args.gtffile + ".ind", 'w') pickle.dump(geneIdx, gidxout, protocol=2) gidxout.close()

Now I get identical cntTable results with either GTF or INDEX input.

Sol.

olivertam commented 4 years ago

Hi Sol,

Great to hear that.

All the best.