simroux / VirSorter

Source code of the VirSorter tool, also available as an App on CyVerse/iVirus (https://de.iplantcollaborative.org/de/)
GNU General Public License v2.0
104 stars 30 forks source link

Backtracking to original contig name? #51

Closed jdwinkler-lanzatech closed 4 years ago

jdwinkler-lanzatech commented 4 years ago

Hi,

Thanks for your hard work on VIRsorter. I'm working on reconciling the genbank file produced by VIRsorter with those generated via other tools for a large genome annotation project I am working on. I need to convert the renamed contigs VIRsorter uses back to their original names. Is there any way to force VIRsorter to output the original names instead of the altered ones used?

For example, NC_014328.1 becomes "NC_014328_1_Clos" (truncated to avoid exceeding the NCBI locus tag name length). I also have assemblies with contig names like contig_1, so regenerating the original name is unfortunately not simple.

Let me know if this isn't clear!

simroux commented 4 years ago

Hi,

Good point, unfortunately there is no easy way to backtrack to the original contig name (it's a flaw in the original design :-/ of VirSorter).

Basically, VirSorter transform all "special" characters (/ . , | ? ! * % and spaces) into underscores _ . If you know the format of your sequence name, you can guess the original name back (e.g. the dot between 014328 and 1 in the Id you provided.

Alternatively, I just committed a new release where the fasta file directory in VirSorter will now include a file named "input_sequences_id_translation.tsv", which will have two tab-separated columns: original id and VirSorter id.

Let me know if that helps !

Best, Simon

jdwinkler-lanzatech commented 4 years ago

Yes, that helps quite a bit! Thank you. One enhancement that came to mind while I was fixing this was direct GFF output-that format is much more tolerant of changes than GBK, as the files output by VIRsorter are technically invalid due to the locus field name length. It's a common issue if you supply assembler-generated contig names to downstream tools, so nothing specific to VIRsorter.

simroux commented 4 years ago

Right, Genbank was another early choice that, in hindsight, I would modify, but this would require much bigger changes. I'll definitely put this on the list, but I can't commit to any timeline :-)