hoelzer-lab / hypro

Extend hypothetical prokka protein annotations using additional homology searches against larger databases
GNU General Public License v3.0
9 stars 0 forks source link

Nextflow: prokka fails for files with too long contig names #30

Closed hoelzer closed 3 years ago

hoelzer commented 3 years ago

example FASTA: SRR10176980_polished.fasta.gz

Command:

nextflow run main.nf --fasta SRR10176980_polished.fasta 

Error:

  [16:27:28] Loading and checking input file: SRR10176980_polished.fasta
  [16:27:28] Contig ID must <= 37 chars long: NODE_1_length_237067_cov_73.923043_pilon_pilon
  [16:27:28] Please rename your contigs OR try '--centre X --compliant' to generate clean contig names.

@EvaFriederike I suggest you try something like:

prokka --cpus ${task.cpus} --centre X --compliant --outdir prokka --prefix ${name} ${params.prokka} ${fasta}

But then we should also think of a way to re-rename the contigs after the annotation. Because the user might want to see his original contig IDs.

We could also think, as an alternative, to always rename the FASTA IDs in the first step, e.g. like done here:

https://github.com/EBI-Metagenomics/emg-viral-pipeline/blob/master/nextflow/modules/rename.nf

https://github.com/EBI-Metagenomics/emg-viral-pipeline/blob/master/nextflow/modules/restore.nf

Here, a python script is used for the renaming that also stores a map to later restore the original IDs in the FASTA.

The python script lives in a bin folder:

https://github.com/EBI-Metagenomics/emg-viral-pipeline/blob/master/bin/rename_fasta.py

from where Nextflow can automatically access it.

In addition, we can add to the nextflow.config a parameter to control parameters for prokka:

    prokka = ''

per default the param is empty, but e.g. its important when someone wants to run a bacteria genome that does not follow the standard gene code (e.g. Mycoplasma bovis). Then he/she can use:

nextflow run hoelzer-lab/hypro --fasta foo.fasta --prokka '--gcode 4'

or so

EvaFriederike commented 3 years ago

This issue should be fixed now. I used the script from emg-viral-pipeline for renaming the contig IDs before the prokka annotation step and wrote a bash script for mapping the original contig IDs back right after. The processes are called rename.nf and restore.nf, the corresponding scripts lie in the bin/ dir.

The prokka parameter is now also part of the nextflow configuration.

hoelzer commented 3 years ago

great! And you are renaming all the Prokka output files? (fasta, gff, ...)? So that the final prokka output files match the original input contig IDs?

EvaFriederike commented 3 years ago

Yes, all prokka output files are scanned for the renamed contig IDs and restored via the mapping file.

hoelzer commented 3 years ago

ok great!