padlocbio / padloc

Locate antiviral defence systems in prokaryotic genomes
MIT License
43 stars 9 forks source link

Multiple matches warning, matching GFF and FAA from prodigal predictions #26

Closed scubalaina closed 8 months ago

scubalaina commented 1 year ago

Hi there,

I've tried running padloc two ways on a genome called AG-538-A01_contigs that has 397 genes. 1) with my own prodigal protein predictions 2) using padloc's prodigal predictions

When I use my own prodigal predictions by running the following below, I get the error that 88 of the proteins are missing from the GFF file

padloc --faa AG-538-A01_contigs.faa --gff AG-538-A01_contigs.gff --outdir padloc_test
[10:12:09] >> Scanning AG-538-A01_contigs for defence system proteins
[10:12:55] >> Searching AG-538-A01_contigs for defence systems
Warning message:
In left_join(., hmm_meta, by = "hmm.name") :
  Each row in `x` is expected to match at most 1 row in `y`.
ℹ Row 1 of `x` matches multiple rows.
ℹ If multiple matches are expected, set `multiple = "all"` to silence this warning.
[10:13:00 AM] ERROR >> 88 protein sequence IDs are missing from GFF file
Execution halted
[10:13:00] ERROR >> errexit on line 397

I thought that maybe there was a prodigal version issue (I'm using V2.6.3) that led to the GFF not matching the FAA in the way padloc needed, so I ran it with padloc's prodigal predictions:

padloc --fna AG-538-A01_contigs.fna --outdir full_padloc/
[10:20:46] >> Predicting protein-coding genes with prodigal
[10:20:46] >> Scanning AG-538-A01_contigs for defence system proteins
[10:21:32] >> Searching AG-538-A01_contigs for defence systems
Warning message:
In left_join(., hmm_meta, by = "hmm.name") :
  Each row in `x` is expected to match at most 1 row in `y`.
ℹ Row 1 of `x` matches multiple rows.
ℹ If multiple matches are expected, set `multiple = "all"` to silence this warning.
[10:22:13 AM] >> Nothing found for AG-538-A01_contigs

There are hits to HMM profiles in the .domtblout (only handful though with bitscores over 100).

Given the warning from PADLOC, is the absence of detectable defenses due to poor hits or because of the parsing interruption? It shouldn't be unexpected that proteins will have multiple hits to a database, but is that how PADLOC is written? I'm happy to edit the parsing script to select a proteins best hit based on bit score myself, but if there are plans to update PADLOC for this anyway, I won't bother.

Also, why might running prodigal myself like this lead to GFF and FAA matching issues? (Trying to reduce computation time and space on my computer by not generating new ORF files every time I run a tool):

prodigal -i final_contigs_AG-538-A01_contigs.fasta -a AG-538-A01_contigs.faa -d AG-538-A01_contigs_orfs.fna -f gff -o AG-538-A01_contigs.gff

Thanks for your time and help!

scubalaina commented 1 year ago

Also, here are the input and output files when I ran it with my own prodigal predictions. (Had to use .txt ending to get github to allow the file upload) AG-538-A01_contigs.faa.txt

AG-538-A01_contigs.gff.txt

AG-538-A01_contigs.domtblout.txt

And here are the files when I ran it with PADLOC's predictions: AG-538-A01_contigs_padloc_prodigal.domtblout.txt AG-538-A01_contigs_padloc_prodigal.faa.txt AG-538-A01_contigs_padloc_prodigal.gff.txt

And here is the input genome: AG-538-A01_contigs.fna.txt

JacksonLab commented 1 year ago

Hello,

The files seem to run OK here. Are you using the --fix-prodigal option? The prodigal GFF/FAA ID formatting is different to some other GFF formats.

scubalaina commented 1 year ago

Hi there,

I ran it now with the --fix-prodigal option, but I still get this parsing error:

$ padloc --faa ../AG-538-A01_contigs.faa --gff ../AG-538-A01_contigs.gff --outdir padloc_test_again --fix-prodigal
[09:35:10] >> Scanning AG-538-A01_contigs for defence system proteins
[09:35:56] >> Searching AG-538-A01_contigs for defence systems
Warning message:
In left_join(., hmm_meta, by = "hmm.name") :
  Each row in `x` is expected to match at most 1 row in `y`.
  Row 1 of `x` matches multiple rows.
  If multiple matches are expected, set `multiple = "all"` to silence this warning.
[09:36:37 AM] >> Nothing found for AG-538-A01_contigs

It looks like this error happened because a single protein is hitting to multiple HMM families in the padloc database (which can happen), but the parser is not written to handle selecting best matches? I still get an HMM output. Is this the case or is there a different problem?

Thank you for your time and help!!

leightonpayne commented 1 year ago

Hi Alaina, thanks for the detailed info!

I'm pretty sure this is a warning introduced by a very recent update to the R package dplyr, as it's just a warning and padloc is otherwise proceeding, it should not affect your results.

You can double check everything is working on your end by running padloc on the genomes provided in padloc/test/. You will probably get a similar warning, but there should be systems identified and output generated.

I just ran:

padloc --faa test/GCF_001688665.2.faa --gff test/GCF_001688665.2.gff

And got the following warning message:

Warning message:
In left_join(., hmm_meta, by = "hmm.name") :
  Each row in `x` is expected to match at most 1 row in `y`.
ℹ Row 12 of `x` matches multiple rows.
ℹ If multiple matches are expected, set `multiple = "all"` to silence this warning.

But the output GCF_001688665.2_padloc.csv was produced as expected.

I just updated all my conda packages, and these are the most recent versions of the R packages that padloc loads. I suspect you also are running dplyr_1.1.0:

other attached packages:
 [1] yaml_2.3.7      getopt_1.20.3   forcats_1.0.0   stringr_1.5.0
 [5] dplyr_1.1.0     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0
 [9] tibble_3.1.8    ggplot2_3.4.1   tidyverse_1.3.2 devtools_2.4.5
[13] usethis_2.1.6

loaded via a namespace (and not attached):
 [1] httr_1.4.4          pkgload_1.3.2       jsonlite_1.8.4
 [4] modelr_0.1.10       shiny_1.7.4         assertthat_0.2.1
 [7] googlesheets4_1.0.1 cellranger_1.1.0    remotes_2.4.2
[10] sessioninfo_1.2.2   pillar_1.8.1        backports_1.4.1
[13] glue_1.6.2          digest_0.6.31       promises_1.2.0.1
[16] rvest_1.0.3         colorspace_2.1-0    htmltools_0.5.4
[19] httpuv_1.6.8        pkgconfig_2.0.3     broom_1.0.3
[22] haven_2.5.1         xtable_1.8-4        scales_1.2.1
[25] processx_3.8.0      later_1.3.0         tzdb_0.3.0
[28] timechange_0.2.0    googledrive_2.0.0   generics_0.1.3
[31] ellipsis_0.3.2      cachem_1.0.6        withr_2.5.0
[34] cli_3.6.0           magrittr_2.0.3      crayon_1.5.2
[37] readxl_1.4.2        mime_0.12           memoise_2.0.1
[40] ps_1.7.2            fs_1.6.1            fansi_1.0.4
[43] xml2_1.3.3          pkgbuild_1.4.0      profvis_0.3.7
[46] tools_4.2.1         prettyunits_1.1.1   hms_1.1.2
[49] gargle_1.3.0        lifecycle_1.0.3     munsell_0.5.0
[52] reprex_2.0.2        callr_3.7.3         compiler_4.2.1
[55] rlang_1.0.6         grid_4.2.1          htmlwidgets_1.6.1
[58] miniUI_0.1.1.1      gtable_0.3.1        DBI_1.1.3
[61] R6_2.5.1            lubridate_1.9.1     fastmap_1.1.0
[64] utf8_1.2.3          stringi_1.7.12      Rcpp_1.0.10
[67] vctrs_0.5.2         dbplyr_2.3.0        tidyselect_1.2.0
[70] urlchecker_1.0.1

@JacksonLab it could be beneficial for us to enforce dependency versions at some point to avoid issues like this.

leightonpayne commented 8 months ago

Dependency versions now pinned in conda recipe for v2.0.0