Error in building gene/TE index

shenmeguimlf commented 3 years ago

Hi Oliver,

I would like to use TEtranscripts to quantify TE expression in a , since it has very high TE content. I modified my RepeatMasker.out file(rm.tab.out in molluscgtf.zip molluscgtf.zip ) following your instructions and use makeTEgtf.pl generated rm.gtf.

However, this reported the following error, and I can't figure out why.

INFO  @ Mon, 12 Jul 2021 16:39:23: 
# ARGUMENTS LIST:
# name = adductor_muscle
# treatment files = ['/mydir/adductor_muscle/alignments.sorted.bam']
# control files = ['/mydir/gill/alignments.sorted.bam']
# GTF file = mygene.gtf 
# TE file = rm.gtf 
# multi-mapper mode = multi 
# stranded = yes
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Mon, 12 Jul 2021 16:39:23: Processing GTF files ...

INFO  @ Mon, 12 Jul 2021 16:39:23: Building gene index .......

100000 GTF lines processed.
INFO  @ Mon, 12 Jul 2021 16:40:24: Done building gene index ......

INFO  @ Mon, 12 Jul 2021 16:40:26: 
Building TE index .......

Error in building gene/TE index

Does INFO @ Mon, 12 Jul 2021 16:40:24: Done building gene index ...... mean that gene index was successfully built? I didn't find anything generated. what's wrong with my input, gene gtf or TE gtf?

Thanks for your time.

Best LF

olivertam commented 3 years ago

Hi LF,

The message means that the gene index was built successfully, and the error occurred during the TE index step. You won't find any files generated by this step, as the program is building the index into memory. I tested your TE GTF out, and it appears to be built correctly (I can get it loaded with my version of TEtranscripts). There could be a few things that could contribute to it:

The version of TEtranscripts: I noticed that one of the parameters stranded had the parameter yes, which was replaced by forward in version 2.1.4. Note that the current version (2.2.1) is also Python 3 compatible (but should still run in Python 2), in case that makes a difference for you.
We have seen this error sometimes with either disk space or memory issue. When I ran this, it used about 400Mb of memory, and around 82Mb of disk space, so I can't imagine that being the main issue.
We have also seen this error "randomly" pop up in runs (typically when we launch multiple runs at the same time), but subsequent runs (with the same command line) works fine. It might be one of those random "one-off" issue as well.

Let us know if you are still having the same issue upon re-running (maybe after updating TEtranscripts, but up to you), and we can try to resolve it.

Thanks.

shenmeguimlf commented 3 years ago

Thanks for your help, It's the python version problem. I reinstalled it via pip and run it on python2.7 and this error is gone, it's still running without output, guess it has something to do with the compatibility issue. I still have some little problem about the version of TEtranscripts:

Today I reinstalled TEtranscripts via pip, the installation log showed

Successfully installed TEtranscripts-2.2.1 argparse-1.4.0 pysam-0.16.0.1

However, when I opened the script of this version of TEtranscripts, as well as yesterday's version, it shows

@version: 2.0.3

and the output is still "stranded = yes", so I'm a liitle confused about it's version....

Also, when running on python3 it has reported some small compatibility issues in some scripts in TEToolkit(indentation/except ValueError as e, etc.), so I guess it's better to run it on python2.

olivertam commented 3 years ago

Hi,

I suspect that an older version of TEtranscripts is still being called instead of the newer one. You might need to do which TEtranscripts to see where it is installed, and remove the older version. That version (2.0.3) would definitely have issues when running in Python 3 (not limited to the errors that you have reported thus far).

Thanks.

shenmeguimlf commented 3 years ago

I have specified the full path of TEtranscripts, it's the one I have installed via pip2

olivertam commented 3 years ago

Hi,

Unfortunately, it is possible that your PYTHONPATH variable is still pointing to the older version of the TEToolkit library that was previously installed, and thus, it's not calling the library from the newest version. You might need to check (in python) what TEToolkit.__file__ points to, and whether that is the location of your pip-installed TEtranscripts.

Thanks.

shenmeguimlf commented 3 years ago

I have removed the previously installed TEtranscripts/TEtoolkit and reinstalled, and unset PYTHONPATH then specified PYTHONPATH to the new installed one, but the strand is still yes.......

shenmeguimlf commented 3 years ago

Anyway, it is still running now, I would forget the version issue and wait for its output.

Thanks.

shenmeguimlf commented 3 years ago

Hi,

this time I installed TEtranscripts via git clone, installed on python 2.7, the script shows that version is 2.2.1, I deleted the TEtoolkit lib installed by pip, and used the one installed by git however, the log is different from last time, the strand is not yes, not forward, but no, given the condition that the input files are identical to the last time.


INFO  @ Tue, 13 Jul 2021 11:20:09: 
# ARGUMENTS LIST:
# name = adductor_muscle
# treatment files = ['/.../alignments.sorted.bam']
# control files = ['/.../alignments.sorted.bam']
# GTF file = mygene.gtf 
# TE file = rm.gtf 
# multi-mapper mode = multi 
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Tue, 13 Jul 2021 11:20:09: Processing GTF files ...

INFO  @ Tue, 13 Jul 2021 11:20:09: Building gene index ....... 

100000 GTF lines processed.
INFO  @ Tue, 13 Jul 2021 11:20:46: Done building gene index ...... 

INFO  @ Tue, 13 Jul 2021 11:20:49: Building TE index ....... 

INFO  @ Tue, 13 Jul 2021 11:21:32: Done building TE index ...... 

INFO  @ Tue, 13 Jul 2021 11:21:32: 
Reading sample files ...

olivertam commented 3 years ago

Hi,

This is because the default value for the stranded parameter was set to no in the newer version (i.e. assume unstranded library). This is because the previous default (yes) was confusing and not appropriate for the majority of stranded libraries (e.g. Illumina, which are reverse stranded). Thus, if you have a stranded library, you would need to determine what parameter is appropriate (e.g. what library prep was used), and provide it appropriately.

Thanks.

shenmeguimlf commented 3 years ago

Hi,

Thanks to your generous help, I appreciate it very much:) the pipeline works out perfectly, I've got the results now, but I'm new to TE and have several other questions...

The second column of TEcount result, is this the number of reads aligned? The ratio of this number between genes is not coherent with the ratio of TPM of these genes calculated by Salmon.
TE elements from the subclass Helitron contribute most to our genome, >20%, DNA/DNA account for about 5%, but the expression shows the expression level of Helitron elements isn't very high, the highest expressed TE is from DNA subclass, approximately 6 fold higher than that of the highest expressed Helitron elements, is this normal? I test it on another animal, it appears that the highest expressed TE is from the largest subclass.
TEcount result is family specific, not giving the locus information, I counted percent of the number of those elements in genome, and the percent of the expression in RNA tissues, and find some elements have low content in genome, but have high expression, like Polinton-1_XT, I don't have a clue about which specific locus has high expression, I would like to discuss if the expression of TE contributes to the expression of their upstream or downstream genes, does that non-locus-specific expression level count for an evidence?

olivertam commented 3 years ago

Hi,

The output of TEcount is raw counts (in essence, number of reads attributable to the TE). This is not TPM, and is the preferred input for differential analysis algorithms (e.g. DESeq2 and edgeR). It is difficult to calculate TPM, as the "length" of the "expressed" TE (especially when aggregated on the sub-family level) cannot be accurately determined.
This is a difficult question to answer, as genome content can be indicative of past activity (i.e. how it managed to increase its copy number in the genome), but might no longer reflect current transcriptional activity. It is possible that some of the TE reads are derived from nearby genes, but that is very difficult to assess with short reads.
Expression of TE can contribute to nearby gene expression, though proving that requires long reads (as mentioned in the point above). The non-locus level assessment could provide hints, but definitely not sufficient on its own. If you want locus-level quantification, you can try TElocal. It does require its own pre-built TE index, which we can help you with. Note that TElocal is still in beta.

Hope this is helpful.

Thanks.

shenmeguimlf commented 3 years ago

Thanks! This genome is assembled using PacBio CLR reads, so we have long reads, but the RNA-seq data are short reads, I'd like to try TElocal, but does that need long RNA reads?

olivertam commented 3 years ago

Hi,

TElocal works with short reads from RNA-seq.

Thanks.

77CXiao commented 4 months ago

That version (2.0.3) would definitely have issues when running in Python 3 (not limited to the errors that you have reported thus far). This is a critical piece of information. I installed version 2.2.3, but the presence of version 2.0.3 caused an "Error in building gene/TE index". Therefore, please check your version first, and it is recommended to delete the old version before running.

mhammell-laboratory / TEtranscripts

Error in building gene/TE index #95