padlocbio / padloc

Locate antiviral defence systems in prokaryotic genomes
MIT License
43 stars 9 forks source link

"ERROR >> errexit on line 386" when running on virus data #16

Closed boweny920 closed 2 years ago

boweny920 commented 2 years ago

Hi , the below error popped up for some of my viral fna data. The tool seems to be working for some fna data but not for others. Is this caused by the nature of the data or the tool itself? Thanks!!

Backtrace: █

  1. ├─gff %>% filter(type == "CDS") %>% separate_attributes()
  2. ├─global::separate_attributes(.)
  3. │ └─%>%(...)
  4. ├─tidyr::spread(., key = key, value = value, fill = NA)
  5. └─tidyr:::spread.data.frame(., key = key, value = value, fill = NA) Execution halted (13:15:04) ERROR >> errexit on line 386
JacksonLab commented 2 years ago

Hello, Do you have an example file that fails you wouldn't mind sharing? Cheers, Simon

boweny920 commented 2 years ago

Of course. I have attached an example file below for your reference, and thanks much for the prompt reply! example_file.fna.zip

Bowen

JacksonLab commented 2 years ago

Hi Bowen,

Apologies for the delay, slipped off the radar... I ran a test and the traceback earlier goes:

[03:13:22 PM] DEBUG >> Reading hmm_meta.txt [03:13:22 PM] DEBUG >> Reading sys_meta.txt [03:13:22 PM] DEBUG >> Reading c7b4b974-b55a-41ff-8996-4cc499720395.domtblout [03:13:26 PM] DEBUG >> Reading c7b4b974-b55a-41ff-8996-4cc499720395_prodigal.gff Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 47138 rows:

I checked the input fasta example you gave and there are multiple sequences (likely duplicates) with the same fasta IDs. For example ">UGV-GENOME-0115401" is in there three times. I'll add another input parsing step that enforces unique contig/sequence names and warns when this condition is not met.

Cheers, Simon