oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
179 stars 40 forks source link

ERROR: Fail to convert seq IDs to less than 15 character #14

Closed intirules closed 6 years ago

intirules commented 6 years ago

So i wanna run my harvest results in Retriever and 1 of 22 genomes give me this problem.

$$$ ERROR: Fail to convert seq IDs to less than 15 characters! Please provide a genome with shorter seq IDs. In harvest i used:

gt ltrharvest -index 1.fna -seqids Yes tabout no -seed 30 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3 -minlenltr 100 -maxlenltr 7000 -mindistltr 1000 -maxdistltr 15000 -similar 80.0 -overlaps no -mintsd 4 -maxtsd 20 -motif TGCA -motifmis 1 -vic 60 > 1.harvest.scn

in Retriever: perl LTR_retriever -genome /medicina/wocana/Tesis/Secuencias/Retriever/1/1.fna -inharvest /medicina/wocana/Tesis/Secuencias/Harvest/1/1.harvest.scn

Any idea how i can fix it?

oushujun commented 6 years ago

Hello,

This message tells that sequence names in the genome is too long for RepeatMasker, which could only take up to 15 characters. LTR_retriever takes two approaches to convert long sequence names to fit the requirement, which is successful most of the time but could also fail occasionally (such as your 1/22 genome). In such cases, you may need to apply some command line skills to convert sequence names manually, then feed the converted genome to LTR_retriever. You may not need to rerun LTRharvest if your sequence order is not changed.

For sequence name conversion, I had successful experiences using Perl one-liners such as:

perl -nle 's/PATTERN//g; print $_' genome.fa > genome.fa.modified

Simply replace 'PATTERN' with the shared string among long sequence names.

Let me know if you have further questions.

Shujun

On Fri, May 4, 2018 at 3:58 PM, Agus notifications@github.com wrote:

So i wanna run my harvest results in Retriever and 1 of 22 genomes give me this problem.

$$$ ERROR: Fail to convert seq IDs to less than 15 characters! Please provide a genome with shorter seq IDs. In harvest i used:

gt ltrharvest -index 1.fna -seqids Yes tabout no -seed 30 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3 -minlenltr 100 -maxlenltr 7000 -mindistltr 1000 -maxdistltr 15000 -similar 80.0 -overlaps no -mintsd 4 -maxtsd 20 -motif TGCA -motifmis 1 -vic 60 > 1.harvest.scn

in Retriever: perl LTR_retriever -genome /medicina/wocana/Tesis/Secuencias/Retriever/1/1.fna -inharvest /medicina/wocana/Tesis/Secuencias/Harvest/1/1.harvest.scn

Any idea how i can fix it?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/14, or mute the thread https://github.com/notifications/unsubscribe-auth/AFt-NJHZf41dlAfgwT5LtDCseYiXfTKrks5tvLL5gaJpZM4TzMMh .

intirules commented 6 years ago

Thx at the end we solve it with:

$vi EditFasta

File='1.fna' OutFile = '1_fixed.fna'

f = open(File,'r') g = open(OutFile,'w') for line in f: line.rstrip() if '>' in line: temp = line.split() temp1 = ' '.join(temp[0:-1]) temp1 = temp1.lstrip('>') towrite = '>'+temp[-1]+' '+temp1+'\n' g.write(towrite) else: g.write(line) f.close() g.close()

$python EditFasta