problem running findtirs

JennyHTLee commented 5 years ago

Hello,

I run tephra all and obtain an error at the findtirs step:

INFO - Command - 'tephra findtirs' started at: 10-08-2019 04:52:59.
ERROR - 'gt tirvish' failed exit value: 1. Here is the output: warning: terminal_inverted_repeat_element (generated, line 0) is too short to be translated (0 nt), skipped domain search

When running findtirs alone:

'gt tirvish' failed exit value: 1. Here is the output: warning: terminal_inverte
d_repeat_element (generated, line 0) is too short to be translated (0 nt), skipp
ed domain search
/root/.tephra/gt/bin/gt tirvish: error: query seqid 'chrlg1' could match more th
an one sequence description

The other steps look fine based on the output files. This is the log:

root@bp:~/tephra/db# grep "Output\|Command" tephra_full.log
Command - 'tephra findltrs' started at: 06-08-2019 12:38:21.
Command - 'tephra findltrs' started at: 06-08-2019 12:38:39.
Command - 'tephra findltrs' completed at: 08-08-2019 14:42:41.
Output files - /db/genome_v2_tephra_ltrs.gff3
Output files - /db/genome_v2_tephra_ltrs.fasta
Command - 'tephra maskref' for LTRs started at: 08-08-2019 14:42:41.
Command - 'tephra maskref' completed at: 08-08-2019 18:27:24. Final output file:
Output files - /db/genome_v2_tephra_masked.fasta
Command - 'tephra findtrims' started at: 08-08-2019 18:27:24.
Command - 'tephra findtrims' completed at: 09-08-2019 16:39:19.
Command - 'tephra classifyltrs' started at: 09-08-2019 16:39:24.
Command - 'tephra classifyltrs' completed at: 09-08-2019 18:19:09.
Output files - /db/genome_v2_tephra_ltrs_trims_classified.gff3
Output files - /db/genome_v2_tephra_ltrs_trims_classified.fasta
Command - 'tephra age' started at: 09-08-2019 18:19:09.
Command - 'tephra age' completed at: 09-08-2019 18:20:09.
Output files - /db/genome_v2_tephra_ltrages.tsv
Command - 'tephra maskref' for classified LTRs/TRIMs started at: 09-08-2019 18:20:09.
Command - 'tephra maskref' completed at: 09-08-2019 20:22:53. Final output file:
Output files - /db/genome_v2_tephra_masked2.fasta
Command - 'tephra sololtr' started at: 09-08-2019 20:22:53.
Command - 'tephra sololtr' completed at: 09-08-2019 21:38:58.
Output files - /db/genome_v2_tephra_sololtrs.gff3
Output files - /db/genome_v2_tephra_sololtrs_rep.tsv
Output files - /db/genome_v2_tephra_sololtrs_seqs.fasta
Command - 'tephra illrecomb' started at: 09-08-2019 21:38:58.
Command - 'tephra illrecomb' completed at: 10-08-2019 00:05:24.
Output files - /db/genome_v2_tephra_illrecomb.fasta
Output files - /db/genome_v2_tephra_illrecomb_rep.tsv
Output files - /db/genome_v2_tephra_illrecomb_stats.tsv
Command - 'tephra findhelitrons' started at: 10-08-2019 00:05:24.
Command - 'tephra findhelitrons' completed at: 10-08-2019 04:30:59.
Output files - /db/genome_v2_tephra_helitrons.gff3
Output files - /db/genome_v2_tephra_helitrons.fasta
Command - 'tephra maskref' for Helitrons started at: 10-08-2019 04:31:01.
Command - 'tephra maskref' completed at: 10-08-2019 04:52:59. Final output file:
Output files - /db/genome_v2_tephra_masked3.fasta
Command - 'tephra findtirs' started at: 10-08-2019 04:52:59.
Command - 'tephra findtirs' completed at: 10-08-2019 05:32:57.
Command - 'tephra classifytirs' started at: 10-08-2019 05:32:57.

Tephra docker version was used:

tephra (Tephra) version 0.12.4 (/usr/local/bin/tephra)

Thanks for your help

Regards, Jenny

sestaton commented 5 years ago

Hi,

It looks like you have some duplicate IDs in your genome file perhaps. We can check that with the commands below:

grep ">" genome.fas | sed 's/>//' | sort -u | wc -l

and

grep -c ">" genome.fas

If the output of those are different there may be duplicates or some other issue. Seeing the ID format would be helpful too.

grep ">" genome.fas | head

Thanks.

JennyHTLee commented 5 years ago

Thanks for your reply,

There seems to be no duplicated IDs, I am not sure if ID is the real issue because the shortened/edited IDs do not solve the problem. It is also probably not one particular sequence causing this, as the same error was obtained using both the full set/subset.

What could be other possibilities? The run halted when the IDs were listed to the gff, there are 809 IDs in total and it stopped at 631.

Best regards, Jenny

sestaton commented 5 years ago

Hi Jenny,

It appears there is something odd with the IDs or sequences, and I thought it might be caused by having duplicate IDs based on the message. Though, that is not the case so it must be something else. It could also be something with the code but it is hard to say.

Can you share the file with me? I'd like to test it myself because that may be faster than trying to propose solutions from a distance.

Thanks, Evan

JennyHTLee commented 5 years ago

Hi Evan,

Sure, I've shared the file "genome.fasta.gz" through fex via your email at evanstaton.com

Thanks for your help!

Best regards, Jenny

sestaton commented 5 years ago

Just FYI, I can recreate the error. This is something in the GenomeTools library and not Tephra, so it's not immediately clear how to resolve it. I will likely have to reduce the error to the problematic sequence and raise the issue to that group, but I will keep this issue updated as I find out more.

Thanks, Evan

sestaton / tephra

problem running findtirs #42