padlocbio / padloc

Locate antiviral defence systems in prokaryotic genomes
MIT License
43 stars 9 forks source link

Protein missing #44

Open oclaisse opened 1 month ago

oclaisse commented 1 month ago

Hello, I want to use padloc-2.0.0 with the fna option it works well but when I want to to use it with annotated proteins with the --faa and --gff options I have this issue ERROR >> 3 protein sequence IDs are missing from GFF file Exécution arrêtée [16:15:36] ERROR >> errexit on line 425 I have tried with the prodigal outputs from the fna option and also with files from bakta annotation without the sequence in the gff file but it the same Could you please help me to solve this? Best regards Olivier

leightonpayne commented 1 month ago

It sounds like there are proteins in your FAA file that do not have matching records in your GFF file, if you are able to attach those files in a comment here or email them to me then I can take a look at what entries are causing the issue (this is usually an easy fix).

leightonpayne commented 1 month ago

This is related to #8 (jump to relevant comment),

"...any genes with the pseudo=True attribute get their IDs derived from the Name attribute—which overwrites your otherwise correct IDs here with whatever was in Name."

In your case, the entries causing the issue are those with attribute ID= AMLJAP_02445, AMLJAP_02450, and AMLJAP_08650.

You can use the same fix I posted in the thread above, by patching your version of padloc to remove the problematic code:

# Download patch
wget -O padloc.patch "https://github.com/padlocbio/padloc/files/13629886/padloc.patch"
# Find padloc script
padloc_src=$(which padloc.R)
# Apply patch
patch -u -b "${padloc_src}" padloc.patch

This saves a backup of the original code to ${padloc_src}.orig, so if you want to restore the original code later on just overwrite the patch:

mv "${padloc_src}.orig" "${padloc_src}"