padlocbio / padloc

Locate antiviral defence systems in prokaryotic genomes
MIT License
43 stars 9 forks source link

ERROR >> errexit on line 406 #23

Closed htomelka closed 1 year ago

htomelka commented 2 years ago

Hi !

I've got the "same" issue as reported in #16 . I run PADLOC in this data ( 6858.zip), and got this

[11:47:23] DEBUG >> Reading 6858.gff Error in spread(): ! Each row of output must be identified by a unique combination of keys. Keys are shared for 60 rows:

I've check my faa and my gff, but I haven't seen any duplicates.

R version is 4.1.0, tidyverse is 1.3.1, yaml is 2.2.1 and getopt is 1.20.3.

I'm probably missing something obvious but can't see what...

If you have a solution, Thanks!

leightonpayne commented 1 year ago

Hi,

The error appears to arise from there being some features in your GFF with multiple entries, e.g. the CDS at line 4484 and 4486 in your GFF (pasted below) is the same CDS (1049533-1050180) with two different annotations - where PADLOC expects only one GFF entry per feature.

NC_012483.1 feature CDS 1049533 1050180 .   -   0   EC_number=3.5.4.16;ID=29966363;db_xref=MaGe:29966363;gene=folE;inference=ab initio prediction:AMIGene:2.0;locus_tag=29966363;note=identified by match to protein family HMM PF01227%3B match to protein family HMM TIGR00063;product=GTP cyclohydrolase I;transl_table=11;translation=MKRGPMATISLQDKLSRNSGSVPPALEKYSTQEIYAELLRRYDEDPTRDGLLRTPERVEKAMKYLTQGYHQEPAGILQGALFDVDYDEMVLVKDIEMFSLCEHHMLPFFGRVHVAYIPNGKVVGLSKIPRLVEVFARRLQVQERMTRQIAEAIQDAINPQGVGVVIEARHLCMMMRGVEKQNSSTVTSAMLGVFQQQNTRGEFLSLVRDRSYQQL
NC_012483.1 feature CDS 1049533 1050180 .   -   0   ID=29968974;db_xref=MaGe:29968974;inference=ab initio prediction:AMIGene:2.0;locus_tag=29968974;note=putative 6-pyruvoyl tetrahydropterin synthase%2C authentic frameshift%3B this gene contains a frame shift which is not the result of sequencing error%3B identified by match to protein family HMM PF01242;transl_table=11;translation=MKRGPMATISLQDKLSRNSGSVPPALEKYSTQEIYAELLRRYDEDPTRDGLLRTPERVEKAMKYLTQGYHQEPAGILQGALFDVDYDEMVLVKDIEMFSLCEHHMLPFFGRVHVAYIPNGKVVGLSKIPRLVEVFARRLQVQERMTRQIAEAIQDAINPQGVGVVIEARHLCMMMRGVEKQNSSTVTSAMLGVFQQQNTRGEFLSLVRDRSYQQL

Here are the details of all 4 CDS that have duplicate entries.

I've not seen this in a GFF file before? I'm assuming it's an artifact of the gene-caller or annotation software you've used to generate these files? Are you able to tell me how you generated these files so I could investigate further?

I'm hesitant to implement anything that would filter these conflicts automatically as it could lead to unexpected results.

For now, I removed the second occurrence of each duplicate from the GFF and FAA files, and this resolved the issue. Here's the input and the results generated.

Cheers.