Closed htomelka closed 1 year ago
Hi,
The error appears to arise from there being some features in your GFF with multiple entries, e.g. the CDS at line 4484 and 4486 in your GFF (pasted below) is the same CDS (1049533-1050180) with two different annotations - where PADLOC expects only one GFF entry per feature.
NC_012483.1 feature CDS 1049533 1050180 . - 0 EC_number=3.5.4.16;ID=29966363;db_xref=MaGe:29966363;gene=folE;inference=ab initio prediction:AMIGene:2.0;locus_tag=29966363;note=identified by match to protein family HMM PF01227%3B match to protein family HMM TIGR00063;product=GTP cyclohydrolase I;transl_table=11;translation=MKRGPMATISLQDKLSRNSGSVPPALEKYSTQEIYAELLRRYDEDPTRDGLLRTPERVEKAMKYLTQGYHQEPAGILQGALFDVDYDEMVLVKDIEMFSLCEHHMLPFFGRVHVAYIPNGKVVGLSKIPRLVEVFARRLQVQERMTRQIAEAIQDAINPQGVGVVIEARHLCMMMRGVEKQNSSTVTSAMLGVFQQQNTRGEFLSLVRDRSYQQL
NC_012483.1 feature CDS 1049533 1050180 . - 0 ID=29968974;db_xref=MaGe:29968974;inference=ab initio prediction:AMIGene:2.0;locus_tag=29968974;note=putative 6-pyruvoyl tetrahydropterin synthase%2C authentic frameshift%3B this gene contains a frame shift which is not the result of sequencing error%3B identified by match to protein family HMM PF01242;transl_table=11;translation=MKRGPMATISLQDKLSRNSGSVPPALEKYSTQEIYAELLRRYDEDPTRDGLLRTPERVEKAMKYLTQGYHQEPAGILQGALFDVDYDEMVLVKDIEMFSLCEHHMLPFFGRVHVAYIPNGKVVGLSKIPRLVEVFARRLQVQERMTRQIAEAIQDAINPQGVGVVIEARHLCMMMRGVEKQNSSTVTSAMLGVFQQQNTRGEFLSLVRDRSYQQL
Here are the details of all 4 CDS that have duplicate entries.
I've not seen this in a GFF file before? I'm assuming it's an artifact of the gene-caller or annotation software you've used to generate these files? Are you able to tell me how you generated these files so I could investigate further?
I'm hesitant to implement anything that would filter these conflicts automatically as it could lead to unexpected results.
For now, I removed the second occurrence of each duplicate from the GFF and FAA files, and this resolved the issue. Here's the input and the results generated.
Cheers.
Hi !
I've got the "same" issue as reported in #16 . I run PADLOC in this data ( 6858.zip), and got this
[11:47:23] DEBUG >> Reading 6858.gff Error in
spread()
: ! Each row of output must be identified by a unique combination of keys. Keys are shared for 60 rows:I've check my faa and my gff, but I haven't seen any duplicates.
R version is 4.1.0, tidyverse is 1.3.1, yaml is 2.2.1 and getopt is 1.20.3.
I'm probably missing something obvious but can't see what...
If you have a solution, Thanks!