Closed retrogenomics closed 3 years ago
Hi Gael,
Yes, it is possible to pre-build the index for TEcount and TEtranscripts. You can find pre-built indices for "common" genome builds here. If you need one from another genome build, or a custom GTF, we can either pre-build it for you, or provide you with a script (still in beta) to generate them. Let me know if this is helpful.
Thanks.
Hi Oliver, Thanks for your quick answer ! This has raised a number of new questions:
Thank you again for your help -Gael
Hi Gael,
xxx.gtf.ind
) (Note: you need to unzip the index file after downloading). TEcount will detect based on the file extension and will just load the file if it's a prebuilt index. Please note that this functionality is only available for TEtranscripts version 2.1.3 or laterPlease let us know if you have other questions.
Thanks.
Hi Oliver,
Thank you for clarifying this. I would love to try the gene indexing using the beta-script. Could you please make it accessible ? access to the link you provided seems to be restricted.
Thanks a lot ! -Gael
Hi Gael,
Thanks for letting us know. Here's the new link. Here is the usage info:
usage: TEtranscripts_indexer [-h] --afile annotation-file --itype index-type
[--verbose verbose] [--version]
Building an index for gene or transposable element annotations file for
TEtranscripts/TEcount.
optional arguments:
-h, --help show this help message and exit
--afile annotation-file
file for indexing of annotations
--itype index-type index type to build for this gtf (gene or TE)
--verbose verbose Set verbose level. 0: only show critical message, 1:
show additional warning message, 2: show process
information, 3: show debug messages. DEFAULT:2
--version show program's version number and exit
Example: TEtrancripts_indexer --afile gene_annotation.gtf --itype gene
Please let us know if you encounter any issues.
Thanks.
Please hold off downloading the script. There's an error that I'm trying to troubleshoot.
Hi Gael,
The error should be fixed now. Feel free to test out the script.
Thanks.
Hi Oliver, The link seems dead now. -Gael
Hi Gael,
Sorry about that. I seem to be able to access it, but I've attached the script (gzipped) to this message. Hope it works.
Thanks. TEtranscripts_indexer.gz
Both are links are working now... I've tried to run it from my home, but it's not finding TEtoolkit libraries:
Traceback (most recent call last): File "./TEtranscripts_indexer", line 25, in
from TEToolkit.TEindex import * ImportError: No module named TEToolkit.TEindex
Where should it be installed? TEtranscripts is in my path, and I have tried to add /TEToolkit/lib/python2.7/site-packages
to my path, too, but without much success.
Hi Gael,
You will need to add TEToolkit/lib/python2.7/site-packages
to your PYTHONPATH
variable rather than your PATH
.
Please also double check the folder path that you're using. From what you posted, it looks like the TEToolkit
folder is in the root
directory, and I just wanted to make sure that it's correct.
Also, note that building the index can take up quite a bit of memory (almost as much as running TEcount
if processing a TE GTF file), so be careful if you have limited memory. For most gene GTF, I haven't needed more than 8Gb of memory, but it is definitely dependent on the size of the GTF file.
Thanks
It worked like a charm for my gene.gtf file ! Thanks
My TEToolkit folder is actually not in the root, I just truncated the start of the path which is very long (on our institute server).
Curiously, I've tried to reindex the TE gtf file obtained from your server (hg38_rmsk_TE.gtf), to be able to use the index on TElocal, too, but I obtained an error:
./TEtranscripts_indexer --afile hg38_rmsk_TE.gtf --itype TE
INFO @ Wed, 10 Mar 2021 21:04:24:
# ARGUMENTS LIST:
# file to index = hg38_rmsk_TE.gtf
# index type = TE
INFO @ Wed, 10 Mar 2021 21:04:24: Processing TE annotation file ...
INFO @ Wed, 10 Mar 2021 21:04:31:
Building TE index .......
Error in building TE index
Hi Gael,
Great to hear that the gene GTF pre-build worked for you. A couple of things to note for the TE GTF index building: 1) It requires >20Gb memory to build the TE index, so unless your computer has sufficient memory, it will crash. Unfortunately, the memory requirement for building TE index is much larger than gene. 2) The TE indices built for TEtranscripts are not compatible with TElocal. If you require a TElocal pre-built index for TElocal, you can obtain them here. If the genome build that you're interested in is not available, we can try to build it for you. While we also have a similar script for TElocal index building (pre-Alpha/Alpha), it seems to take MUCH longer for many TE GTF (>2 days).
Let me know if you have other questions. Thanks.
Memory shouldn't be an issue (we have 1To of RAM on our server and internal TEtranscripts or TEcount indexing works fine for both genes and TE).
However I was too fast. Indexing of genes seemed to work: no error and an output file generated. However when using it with TEcount, I get this error:
ERROR:root:No such file: =/home/gcristofari/references/human/hg38/hg38_refGene.gtf.ind !
but the file is definitely there:
$ ls -l /home/gcristofari/references/human/hg38/hg38_refGene.gtf.ind
-rw-r--r-- 1 gcristofari grp_cristofari 38568793 Mar 10 21:55 /home/gcristofari/references/human/hg38/hg38_refGene.gtf.ind
I'm sorry to bother you so much with this. It is probably an obvious issue, but I can't figure out where the error comes from.
Could you provide the command line that you use? I'm surprised to see the =
in the error message.
Hi Oliver,
You were right: I had a typo in my script with a double '=' when assigning a variable. This is where I am:
Conclusion: I can work with what I have, but if you want to solve the indexing of TE issue, I'd be happy to provide any additional info.
Conclusion:
Thanks in advance for your great help -Gael
For the gene index for TElocal
: Could you confirm the version of TElocal
that you're using? We were unable to replicate the error on our end (i.e. our TEtranscripts_indexer
gene index was loaded by TElocal
version 1.1.1 (running under python
versions 2.7.15 & 3.7.4)).
Question about the TE index for GRCh38 GENCODE: are you looking for one compatible with TEtrancripts
, or TElocal
? The former is available here. If it's the latter, it'll take time, but we can try to build it.
We were unable to replicate your error when running TEtranscripts_indexer
on the hg38_rmsk_TE.gtf from our website. Could you let us know which python
version you are using?
Thanks.
For the gene index for TElocal: Could you confirm the version of TElocal that you're using? We were unable to replicate the error on our end (i.e. our TEtranscripts_indexer gene index was loaded by TElocal version 1.1.1 (running under python versions 2.7.15 & 3.7.4)).
TElocal 1.1.1, Python 2.7.17, Python 3.6.12
Question about the TE index for GRCh38 GENCODE: are you looking for one compatible with TEtrancripts, or TElocal? The former is available here. If it's the latter, it'll take time, but we can try to build it.
Both. I've got the one for TEtranscripts on your site, but not that for TElocal. If it is very time-consuming, please wait since I'm trying other alternatives.
We were unable to replicate your error when running TEtranscripts_indexer on the hg38_rmsk_TE.gtf from our website. Could you let us know which python version you are using?
See above.
I don't know if this is related or not, but for some reasons, I need to add the TEToolkit/lib/python2.7/site-packages
to PYTHONPATH
when running TEtranscripts_indexer
but I need to remove it to run TEtranscripts
or TElocal
.
Hi Gael,
That is very unusual. If you are able to run TEtranscripts
without that addition to PYTHONPATH
, then TEtranscripts_indexer should also run without issue. I wonder if there's a clash of python versions (2.7.17 and 3.6.12) that is causing some issue.
By default, do you know which version of python TEtranscripts
is using? Since the shebang line (#!
) in TEtranscripts_indexer points to /bin/env python
, it might be pointing to a different version of python than the other two are using (and thus why it might be failing to find the TEToolkit
libraries despite TEtranscripts
being installed).
Sorry if this is not too helpful.
We are building the TElocal index for GENCODE, but as we mentioned, it takes a while. I'll let you know once it's done.
Thanks.
Hi Gael,
The GRCh38 GENCODE TElocal index has been built, and is available here. Let us know if it doesn't work.
Thanks.
Hi Oliver, Thanks for your help. I'll try it and come back to you.
Hi Oliver and Gael,
I don't hope to hijack this thread, but can I request a GTF for GRCm39 M26?
Thanks for TEtranscripts and all your help.
Hi,
Thank you for your interest in the software. The TE GTF for GENCODE GRCm39 is available here. Please let us know if you encountered any problems with the file.
Thanks.
Dear @olivertam,
I'm following up to this thread to ask if there is any newer version of the TEtranscripts_indexer.py
or if I can use it to index my custom GTF files.
Thanks a lot for your work!
Hi,
Thank you for your interest in the software. You should be able to use the current version to index custom GTF files, as long as they are compatible with TEtranscripts/TEcount.
Thanks.
Yes sorry, my bad. I meant if I can use it to index large gtfs than needs to be processed over several different experiments. Since it takes a long time to index them, I was thinking about indexing them before running TEtranscripts/TEcount, so to save machine time.
Hi,
Yes, you can definitely use it to pre-build the gene and TE index, especially if it takes a long time to index them each time. This is the usage:
usage: TEtranscripts_indexer [-h] --afile annotation-file --itype index-type
[--verbose verbose] [--version]
Building an index for gene or transposable element annotations file for
TEtranscripts/TEcount.
optional arguments:
-h, --help show this help message and exit
--afile annotation-file
file for indexing of annotations
--itype index-type index type to build for this gtf (gene or TE)
--verbose verbose Set verbose level. 0: only show critical message, 1:
show additional warning message, 2: show process
information, 3: show debug messages. DEFAULT:2
--version show program's version number and exit
Example: TEtrancripts_indexer --afile gene_annotation.gtf --itype gene
If you have been able to run TEtranscripts with your custom index, this should be able to prebuild them, and you can then use the xxxx.gtf.ind
file instead of the GTF file for your runs.
Let us know if you encounter any issues.
Thanks.
Hi,
I was wondering if it is possible to run TEcount (or Tetranscripts) using external index files, instead to re-generate the index each time it TEcount is run, to save it and then to reload it when needed. This step is quite long and when running multiple TEcount instances, this is clearly a bottleneck.
Thanks in advance for your help -Gael