mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

running TEcount with external indexes #87

Closed retrogenomics closed 3 years ago

retrogenomics commented 3 years ago

Hi,

I was wondering if it is possible to run TEcount (or Tetranscripts) using external index files, instead to re-generate the index each time it TEcount is run, to save it and then to reload it when needed. This step is quite long and when running multiple TEcount instances, this is clearly a bottleneck.

Thanks in advance for your help -Gael

olivertam commented 3 years ago

Hi Gael,

Yes, it is possible to pre-build the index for TEcount and TEtranscripts. You can find pre-built indices for "common" genome builds here. If you need one from another genome build, or a custom GTF, we can either pre-build it for you, or provide you with a script (still in beta) to generate them. Let me know if this is helpful.

Thanks.

retrogenomics commented 3 years ago

Hi Oliver, Thanks for your quick answer ! This has raised a number of new questions:

  1. How do you instruct TEcount to use the index (or where should it be located)? I have tried to simply add it in the same folder as the .gtf file but it didn't work (still generating the indexes)
  2. Is it possible to use preindexed gene gtf file, too?

Thank you again for your help -Gael

olivertam commented 3 years ago

Hi Gael,

  1. Instead of providing the GTF file, you provide the index (xxx.gtf.ind) (Note: you need to unzip the index file after downloading). TEcount will detect based on the file extension and will just load the file if it's a prebuilt index. Please note that this functionality is only available for TEtranscripts version 2.1.3 or later
  2. It is possible to use pre-built gene index files too. We haven't provided prebuilt indices for genes (as there are too many sources with frequent updates to keep up-to-date), but they can be built with the script that I mentioned in my previous comment. As mentioned previously, the script is still in beta, and requires TEtranscripts to be installed for it to work. Please feel free to contact us if you want to use it.

Please let us know if you have other questions.

Thanks.

retrogenomics commented 3 years ago

Hi Oliver,

Thank you for clarifying this. I would love to try the gene indexing using the beta-script. Could you please make it accessible ? access to the link you provided seems to be restricted.

Thanks a lot ! -Gael

olivertam commented 3 years ago

Hi Gael,

Thanks for letting us know. Here's the new link. Here is the usage info:

usage: TEtranscripts_indexer [-h] --afile annotation-file --itype index-type
                             [--verbose verbose] [--version]

Building an index for gene or transposable element annotations file for
TEtranscripts/TEcount.

optional arguments:
  -h, --help            show this help message and exit
  --afile annotation-file
                        file for indexing of annotations
  --itype index-type    index type to build for this gtf (gene or TE)
  --verbose verbose     Set verbose level. 0: only show critical message, 1:
                        show additional warning message, 2: show process
                        information, 3: show debug messages. DEFAULT:2
  --version             show program's version number and exit

Example: TEtrancripts_indexer --afile gene_annotation.gtf --itype gene

Please let us know if you encounter any issues.

Thanks.

olivertam commented 3 years ago

Please hold off downloading the script. There's an error that I'm trying to troubleshoot.

olivertam commented 3 years ago

Hi Gael,

The error should be fixed now. Feel free to test out the script.

Thanks.

retrogenomics commented 3 years ago

Hi Oliver, The link seems dead now. -Gael

olivertam commented 3 years ago

Hi Gael,

Sorry about that. I seem to be able to access it, but I've attached the script (gzipped) to this message. Hope it works.

Thanks. TEtranscripts_indexer.gz

retrogenomics commented 3 years ago

Both are links are working now... I've tried to run it from my home, but it's not finding TEtoolkit libraries:

Traceback (most recent call last): File "./TEtranscripts_indexer", line 25, in from TEToolkit.TEindex import * ImportError: No module named TEToolkit.TEindex

Where should it be installed? TEtranscripts is in my path, and I have tried to add /TEToolkit/lib/python2.7/site-packages to my path, too, but without much success.

olivertam commented 3 years ago

Hi Gael,

You will need to add TEToolkit/lib/python2.7/site-packages to your PYTHONPATH variable rather than your PATH. Please also double check the folder path that you're using. From what you posted, it looks like the TEToolkit folder is in the root directory, and I just wanted to make sure that it's correct. Also, note that building the index can take up quite a bit of memory (almost as much as running TEcount if processing a TE GTF file), so be careful if you have limited memory. For most gene GTF, I haven't needed more than 8Gb of memory, but it is definitely dependent on the size of the GTF file.

Thanks

retrogenomics commented 3 years ago

It worked like a charm for my gene.gtf file ! Thanks

My TEToolkit folder is actually not in the root, I just truncated the start of the path which is very long (on our institute server).

Curiously, I've tried to reindex the TE gtf file obtained from your server (hg38_rmsk_TE.gtf), to be able to use the index on TElocal, too, but I obtained an error:

./TEtranscripts_indexer --afile hg38_rmsk_TE.gtf --itype TE

INFO  @ Wed, 10 Mar 2021 21:04:24:
# ARGUMENTS LIST:
# file to index = hg38_rmsk_TE.gtf
# index type = TE

INFO  @ Wed, 10 Mar 2021 21:04:24: Processing TE annotation file ...

INFO  @ Wed, 10 Mar 2021 21:04:31:
Building TE index .......

Error in building TE index
olivertam commented 3 years ago

Hi Gael,

Great to hear that the gene GTF pre-build worked for you. A couple of things to note for the TE GTF index building: 1) It requires >20Gb memory to build the TE index, so unless your computer has sufficient memory, it will crash. Unfortunately, the memory requirement for building TE index is much larger than gene. 2) The TE indices built for TEtranscripts are not compatible with TElocal. If you require a TElocal pre-built index for TElocal, you can obtain them here. If the genome build that you're interested in is not available, we can try to build it for you. While we also have a similar script for TElocal index building (pre-Alpha/Alpha), it seems to take MUCH longer for many TE GTF (>2 days).

Let me know if you have other questions. Thanks.

retrogenomics commented 3 years ago

Memory shouldn't be an issue (we have 1To of RAM on our server and internal TEtranscripts or TEcount indexing works fine for both genes and TE).

However I was too fast. Indexing of genes seemed to work: no error and an output file generated. However when using it with TEcount, I get this error:

ERROR:root:No such file: =/home/gcristofari/references/human/hg38/hg38_refGene.gtf.ind !

but the file is definitely there:

$ ls -l /home/gcristofari/references/human/hg38/hg38_refGene.gtf.ind
-rw-r--r-- 1 gcristofari grp_cristofari 38568793 Mar 10 21:55 /home/gcristofari/references/human/hg38/hg38_refGene.gtf.ind

I'm sorry to bother you so much with this. It is probably an obvious issue, but I can't figure out where the error comes from.

olivertam commented 3 years ago

Could you provide the command line that you use? I'm surprised to see the = in the error message.

retrogenomics commented 3 years ago

Hi Oliver,

You were right: I had a typo in my script with a double '=' when assigning a variable. This is where I am:

TEtranscripts:

Conclusion: I can work with what I have, but if you want to solve the indexing of TE issue, I'd be happy to provide any additional info.

TElocal:

Conclusion:

Thanks in advance for your great help -Gael

olivertam commented 3 years ago

For the gene index for TElocal: Could you confirm the version of TElocal that you're using? We were unable to replicate the error on our end (i.e. our TEtranscripts_indexer gene index was loaded by TElocal version 1.1.1 (running under python versions 2.7.15 & 3.7.4)).

Question about the TE index for GRCh38 GENCODE: are you looking for one compatible with TEtrancripts, or TElocal? The former is available here. If it's the latter, it'll take time, but we can try to build it.

We were unable to replicate your error when running TEtranscripts_indexer on the hg38_rmsk_TE.gtf from our website. Could you let us know which python version you are using?

Thanks.

retrogenomics commented 3 years ago

For the gene index for TElocal: Could you confirm the version of TElocal that you're using? We were unable to replicate the error on our end (i.e. our TEtranscripts_indexer gene index was loaded by TElocal version 1.1.1 (running under python versions 2.7.15 & 3.7.4)).

TElocal 1.1.1, Python 2.7.17, Python 3.6.12

Question about the TE index for GRCh38 GENCODE: are you looking for one compatible with TEtrancripts, or TElocal? The former is available here. If it's the latter, it'll take time, but we can try to build it.

Both. I've got the one for TEtranscripts on your site, but not that for TElocal. If it is very time-consuming, please wait since I'm trying other alternatives.

We were unable to replicate your error when running TEtranscripts_indexer on the hg38_rmsk_TE.gtf from our website. Could you let us know which python version you are using?

See above.

I don't know if this is related or not, but for some reasons, I need to add the TEToolkit/lib/python2.7/site-packages to PYTHONPATH when running TEtranscripts_indexer but I need to remove it to run TEtranscripts or TElocal.

olivertam commented 3 years ago

Hi Gael,

That is very unusual. If you are able to run TEtranscripts without that addition to PYTHONPATH, then TEtranscripts_indexer should also run without issue. I wonder if there's a clash of python versions (2.7.17 and 3.6.12) that is causing some issue.

By default, do you know which version of python TEtranscripts is using? Since the shebang line (#!) in TEtranscripts_indexer points to /bin/env python, it might be pointing to a different version of python than the other two are using (and thus why it might be failing to find the TEToolkit libraries despite TEtranscripts being installed). Sorry if this is not too helpful.

We are building the TElocal index for GENCODE, but as we mentioned, it takes a while. I'll let you know once it's done.

Thanks.

olivertam commented 3 years ago

Hi Gael,

The GRCh38 GENCODE TElocal index has been built, and is available here. Let us know if it doesn't work.

Thanks.

retrogenomics commented 3 years ago

Hi Oliver, Thanks for your help. I'll try it and come back to you.

kashyapchhatbar commented 3 years ago

Hi Oliver and Gael,

I don't hope to hijack this thread, but can I request a GTF for GRCm39 M26?

Thanks for TEtranscripts and all your help.

olivertam commented 3 years ago

Hi,

Thank you for your interest in the software. The TE GTF for GENCODE GRCm39 is available here. Please let us know if you encountered any problems with the file.

Thanks.

filonico commented 3 months ago

Dear @olivertam,

I'm following up to this thread to ask if there is any newer version of the TEtranscripts_indexer.py or if I can use it to index my custom GTF files.

Thanks a lot for your work!

olivertam commented 3 months ago

Hi,

Thank you for your interest in the software. You should be able to use the current version to index custom GTF files, as long as they are compatible with TEtranscripts/TEcount.

Thanks.

filonico commented 3 months ago

Yes sorry, my bad. I meant if I can use it to index large gtfs than needs to be processed over several different experiments. Since it takes a long time to index them, I was thinking about indexing them before running TEtranscripts/TEcount, so to save machine time.

olivertam commented 3 months ago

Hi,

Yes, you can definitely use it to pre-build the gene and TE index, especially if it takes a long time to index them each time. This is the usage:

usage: TEtranscripts_indexer [-h] --afile annotation-file --itype index-type
                             [--verbose verbose] [--version]

Building an index for gene or transposable element annotations file for
TEtranscripts/TEcount.

optional arguments:
  -h, --help            show this help message and exit
  --afile annotation-file
                        file for indexing of annotations
  --itype index-type    index type to build for this gtf (gene or TE)
  --verbose verbose     Set verbose level. 0: only show critical message, 1:
                        show additional warning message, 2: show process
                        information, 3: show debug messages. DEFAULT:2
  --version             show program's version number and exit

Example: TEtrancripts_indexer --afile gene_annotation.gtf --itype gene

If you have been able to run TEtranscripts with your custom index, this should be able to prebuild them, and you can then use the xxxx.gtf.ind file instead of the GTF file for your runs.

Let us know if you encounter any issues.

Thanks.