ERROR >> errexit on line 406

Hi !

I've got the "same" issue as reported in #16 . I run PADLOC in this data ( 6858.zip), and got this

[11:47:23] DEBUG >> Reading 6858.gff Error in spread(): ! Each row of output must be identified by a unique combination of keys. Keys are shared for 60 rows:

8071, 8080
9456, 9465
29314, 29322
31962, 31970
8072, 8081
9457, 9466
29315, 29323
31963, 31971
9458, 9467
8074, 8082
9459, 9468
29316, 29324
31964, 31972
8075, 8083
9460, 9469
29317, 29325
31965, 31973
8076, 8084
9461, 9470
29318, 29326
31966, 31974
9462, 9471
8078, 8085
9463, 9472
29320, 29327
31968, 31975
8079, 8086
9464, 9473
29321, 29328
31969, 31976 Backtrace: x
1. +-gff %>% filter(type == "CDS") %>% separate_attributes()
2. +-global separate_attributes(.)
3. | -... %>% spread(key = key, value = value, fill = NA)
4. +-tidyr::spread(., key = key, value = value, fill = NA)
5. -tidyr:::spread.data.frame(., key = key, value = value, fill = NA)
6. -rlang::abort(...) Execution halted [11:47:24] ERROR >> errexit on line 406

I've check my faa and my gff, but I haven't seen any duplicates.

R version is 4.1.0, tidyverse is 1.3.1, yaml is 2.2.1 and getopt is 1.20.3.

I'm probably missing something obvious but can't see what...

If you have a solution, Thanks!

Hi,

The error appears to arise from there being some features in your GFF with multiple entries, e.g. the CDS at line 4484 and 4486 in your GFF (pasted below) is the same CDS (1049533-1050180) with two different annotations - where PADLOC expects only one GFF entry per feature.

NC_012483.1 feature CDS 1049533 1050180 .   -   0   EC_number=3.5.4.16;ID=29966363;db_xref=MaGe:29966363;gene=folE;inference=ab initio prediction:AMIGene:2.0;locus_tag=29966363;note=identified by match to protein family HMM PF01227%3B match to protein family HMM TIGR00063;product=GTP cyclohydrolase I;transl_table=11;translation=MKRGPMATISLQDKLSRNSGSVPPALEKYSTQEIYAELLRRYDEDPTRDGLLRTPERVEKAMKYLTQGYHQEPAGILQGALFDVDYDEMVLVKDIEMFSLCEHHMLPFFGRVHVAYIPNGKVVGLSKIPRLVEVFARRLQVQERMTRQIAEAIQDAINPQGVGVVIEARHLCMMMRGVEKQNSSTVTSAMLGVFQQQNTRGEFLSLVRDRSYQQL
NC_012483.1 feature CDS 1049533 1050180 .   -   0   ID=29968974;db_xref=MaGe:29968974;inference=ab initio prediction:AMIGene:2.0;locus_tag=29968974;note=putative 6-pyruvoyl tetrahydropterin synthase%2C authentic frameshift%3B this gene contains a frame shift which is not the result of sequencing error%3B identified by match to protein family HMM PF01242;transl_table=11;translation=MKRGPMATISLQDKLSRNSGSVPPALEKYSTQEIYAELLRRYDEDPTRDGLLRTPERVEKAMKYLTQGYHQEPAGILQGALFDVDYDEMVLVKDIEMFSLCEHHMLPFFGRVHVAYIPNGKVVGLSKIPRLVEVFARRLQVQERMTRQIAEAIQDAINPQGVGVVIEARHLCMMMRGVEKQNSSTVTSAMLGVFQQQNTRGEFLSLVRDRSYQQL

Here are the details of all 4 CDS that have duplicate entries.

I've not seen this in a GFF file before? I'm assuming it's an artifact of the gene-caller or annotation software you've used to generate these files? Are you able to tell me how you generated these files so I could investigate further?

I'm hesitant to implement anything that would filter these conflicts automatically as it could lead to unexpected results.

For now, I removed the second occurrence of each duplicate from the GFF and FAA files, and this resolved the issue. Here's the input and the results generated.

Cheers.

padlocbio / padloc

ERROR >> errexit on line 406 #23