padlocbio / padloc

Locate antiviral defence systems in prokaryotic genomes
MIT License
43 stars 9 forks source link

Error in read_tsv(., col_names = c("temp", "target.description"), comment = "#" #19

Closed htomelka closed 2 years ago

htomelka commented 2 years ago

Hi ! I've tried to run padloc with .faa and .gff files and I had an error. Thinking the issue was my files, I've tried with test data and get the same error :

padloc --faa GCF_001688665.2.faa --gff GCF_001688665.2.gff --cpu 4

[09:35:59] >> Scanning GCF_001688665.2 for defence system proteins [09:37:40] >> Searching GCF_001688665.2 for defence systems Error in read_tsv(., col_names = c("temp", "target.description"), comment = "#", : unused argument (show_col_types = FALSE) Calls: read_domtbl ... type_convert -> stopifnot -> is.data.frame -> separate Execution halted [09:37:47] ERROR >> errexit on line 397

It seems to be the same issues than #16

If you have a solution, Thanks!

JacksonLab commented 2 years ago

Hi,

The warning flags that "show_col_types = FALSE" is an unused argument. Please check and update your R packages. readr should be v 2.0.0 or later.

Cheers,

Simon

@leightonpayne should we add the dependency checking script as an optional argument for padloc?

htomelka commented 2 years ago

Hi !

For other things, I was loading a version of r that was messing with padloc, problem solved !

Thanks you !

htomelka commented 2 years ago

One issue solved, another appear...

I've run padloc with several file, with this command : padloc --faa XXXX --gff XXXX --outdir XXXX --debug --cpu 8

All the file are generated the same way, but with some, I've got this error :

[10:19:13] DEBUG >> Reading 1397.gff
Error: Problem with `mutate()` column `ID`.
i `ID = ifelse(is.na(pseudo), ID, Name)`.
x object 'Name' not found
Backtrace:
    x
 1. +-gff %>% mutate(ID = ifelse(is.na(pseudo), ID, Name))
 2. +-dplyr::mutate(., ID = ifelse(is.na(pseudo), ID, Name))
 3. +-dplyr:::mutate.data.frame(., ID = ifelse(is.na(pseudo), ID, Name))
 4. | \-dplyr:::mutate_cols(.data, ..., caller_env = caller_env())
 5. |   +-base::withCallingHandlers(...)
 6. |   \-mask$eval_all_mutate(quo)
 7. +-base::ifelse(is.na(pseudo), ID, Name)
 8. \-base::.handleSimpleError(...)
 9.   \-dplyr:::h(simpleError(msg, call))
Execution halted
[10:19:14] ERROR >> errexit on line 397

dplyr version is 1.0.7, and given that the issue does not appear for all my files, I can't understand where the problem is...

I joinded files which have the issues, if you see the solution, Thanks!

example.zip

JacksonLab commented 2 years ago

Seems to be an edge case related to how we deal with pseudogenes in the PGAP-formatted RefSeq files. If the [pseudo] field is present in the gff attributes (e.g. "pseudo="), we take [Name] as the [ID]. In your case, [pseudo] is present without [Name] (hence the somewhat obscure warning 'Name' not found). I'm not familiar with the "MicroScope annotation platform" used to generate this gff, so the easiest solution I found was to remove all ";pseudo=None" in the supplied .gff and that solved the issue. I've made a note to add more informative error reporting for cases like like.

Cheers,

Simon

htomelka commented 2 years ago

Thanks you for your help, everythings works fine now !