oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
179 stars 40 forks source link

Sequence ID length problem #16

Closed cslamo closed 6 years ago

cslamo commented 6 years ago

Hi. LTR_retriever stops with a RepeatMasker error resulting from a sequence ID longer than 50 characters in the file xxx.fa.mod.ltrTE.trunc. The ID in question is >LSRX01000097.1:1074794..1082843|LSRX01000097.1:1074380..1083242[IN]

See that the original sequence IDs are not particularly long. but due to the large coordinate numbers the IDs become long. Is there a fix for this problem? Below is the whole output including the repeatmasker test run

Thanks. Claudio

##########################

LTR_retriever v1.8.0

##########################

Contributors: Shujun Ou, Ning Jiang

Please cite: Ou S, Jiang N: LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiology 2018, 176:1410-1422 Parameters: -genome /home/manager/BigShare/dinos/11-200.fa -infinder /home/manager/LTR_Finder/source/11-200.finder.scn

四 5月 31 19:51:58 CST 2018 Dependency checking: All passed! 四 5月 31 19:52:52 CST 2018 The longest sequence ID in the genome contains 109 characters, which is longer than the limit (15) Trying to reformat seq IDs... Attempt 1... 四 5月 31 19:53:12 CST 2018 Seq ID conversion successful!

四 5月 31 19:53:12 CST 2018 Start to convert inputs... Total candidates: 173 Total uniq candidates: 173

四 5月 31 19:53:25 CST 2018 Module 1: Start to clean up candidates... Sequences with 10 missing bp or 0.8 missing data rate will be discarded. Sequences containing tandem repeats will be discarded.

四 5月 31 19:53:31 CST 2018 145 clean candidates remained

四 5月 31 19:53:31 CST 2018 Modules 2-5: Start to analyze the structure of candidates... The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

四 5月 31 19:54:13 CST 2018 Intact LTR-RT found: 118

Can't remove /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.pass.clust: Text file busy, skipping file. 四 5月 31 19:54:30 CST 2018 Module 6: Start to analyze truncated LTR-RTs... Truncated LTR-RTs without the intact version will be retained in the LTR-RT library. Use -notrunc if you don't want to keep them.

四 5月 31 19:54:30 CST 2018 4 truncated LTR-RTs found Can't remove /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.trunc: Text file busy, skipping file. Warning: LOC list /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.veryfalse is empty. ERROR: RepeatMasker is not running properly! Please check the file /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.mask.lib and /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.trunc and test run: RepeatMasker -e ncbi -q -pa 4 -no_is -norna -nolow -div 40 -lib /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.mask.lib -cutoff 225 /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.trunc Please report errors to https://github.com/oushujun/LTR_retriever/issues Program halt!

manager@sb:~/RepeatMasker$ ./RepeatMasker -e ncbi -q -pa 4 -no_is -norna -nolow -div 40 -lib /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.mask.lib -cutoff 225 /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.trunc RepeatMasker version open-4.0.7 Search Engine: NCBI/RMBLAST [ 2.2.27+ ] Master RepeatMasker Database: /home/manager/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127-rb20170127 ) Custom Repeat Library: /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.mask.lib

analyzing file /home/manager/BigShare/dinos/11-200.fa.mod.ltrTE.trunc FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 ) at ./RepeatMasker line 718.

oushujun commented 6 years ago

Hello Claudio,

Sorry for this error. I noticed the naming limitation of RepeatMasker while developing LTR_retriever. I implemented two strategies to shorten the sequence namespace but obviously they do not solve everything. In this case, please cut down the genome sequence name before running LTR_retriever (and make sure each sequence still have a unique name). I see you were using LTR_finder to predict LTR candidates. So if you cut down the name in the genome, you can also apply the same strategy to cut down the sequence names in the LTR_finder output file, or rerun LTR_finder.

You may use the script described in this thread (#14) to convert your genome.

I also notice two complains from your system: Text file busy, skipping file. Please make sure the working folder is not busy and read/write enabled.

Let me know if the issue resolved. I will also further look into this issue.

Best, Shujun

cslamo commented 6 years ago

Thanks, it's running fine now.

Claudio