gmbecker / genbankr

http://bioconductor.org/packages/devel/bioc/html/genbankr.html
14 stars 9 forks source link

error inporting genbank file with readGenBan() #6

Closed lucaz88 closed 4 years ago

lucaz88 commented 4 years ago

Hi, I get this error when I try to import a Genbank file generated in RAST: Error in base::strsplit(x, ...) : non-character argument

I saw that it could be due to the presence of the Feature field 'exon' but I don't have any in my files. I am using genbankr version 1.12.0 on R 3.6.3

osilander commented 4 years ago

Hi, I'm having the same issue: Error in base::strsplit(x, ...) : non-character argument genbankr 1.12.0 R 3.6.0 PROKKA_03182020.gbk.gz

gmbecker commented 4 years ago

So in Devel (R-devle 4.0.0, Bioc devel branch) I'm not getting thaat error with the linked file, but I'm seeing a different error for this genbank file, it seems because some characters in the Origin field are not recognized by DNAStringSet:

> res = readGenBank("~/Downloads/PROKKA_03182020.gbk")
Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start,  : 
  key 79 (char 'O') not in lookup table

More investigation tells me O is not the only invalid charactere for DNAStrings

Browse[2]> table(stuff[[1]])

      a       c       g       G       I       N       O       R       t 
1318835 1350348 1342632       7      14       7       7       7 1321240 

I am not a biologist. Can you tell me why these are in there? Is this RNA rather than DNA? Is it an error in the genbank file itself?

gmbecker commented 4 years ago

I just realized its hitting the word ORIGIN for some reason, continuing to investigate

osilander commented 4 years ago

shoot sorry, yes I also ran into that error but then used the solution from here: https://github.com/gmbecker/genbankr/issues/3 and then ran into the above issue.

gmbecker commented 4 years ago

Ok, so I have a fix locally that incorporates the multi-record fix and a fix for this, but the issue is, in a sense, with the file. The first few lines of the file are:

[1] "LOCUS       1                    4932770 bp    DNA     linear       18-MAR-2020"
[2] "DEFINITION  Genus species strain strain."                                       
[3] "ACCESSION   "                                                                   
[4] "VERSION"                                                                        
[5] "KEYWORDS    ."                                                                  
[6] "SOURCE      Genus species"       

So it has no accession or version information, which genbankr uses to identify/specify the genome. Is this intentional? the Definition, at least, seems obviously wrong (though present...)

gmbecker commented 4 years ago

This is fixed in the commit I just made (package version 1.15.1) which should propogate through the devel build system in the next day or so. Please try with that (remember you must use the devel bioc) and re-open if the problem(s) is not fixed.

osilander commented 4 years ago

Thanks for this quick work. I'm having trouble figuring this out. Does this mean I need to use R 4.0?

gmbecker commented 4 years ago

Are you generating these files yoruself? is there a way to not have empty VERSION/ACCESSION fields? I would expect the package to work as is if those fields aren't empty.

Currently the fix is only in devel, which means yes it requires 4.0. I can think about patching it in the release branch but I'dprefer some confirmation thee devel version is actually getting the right result before doing that if possible.

osilander commented 4 years ago

Generated by Prokka https://github.com/tseemann/prokka I can fix the genus and species and maybe keywords. I could also put in a fake accession. However, they're not on GenBank yet. Give me a few minutes, I'll have to reannotate.

osilander commented 4 years ago

I tried putting in fake accessions and versions, but that gave the same similar error. I then removed the nucleotide seqs (which I don't need) and put in fake accessions and versions and it worked. Apologies, I'm reluctant to put R4.0 and am not confident in my ability to put the install into it's own environment.

There are still some errors, but several are expected as it's bacteria:

No exons read from genbank file. Assuming sections of CDS are full exons
No transcript features (mRNA) found, using spans of CDSs
Warning message:
In fill_stack_df(rawcdss, sqinfo = sqinfo) :
  Got unexpected multi-value field(s) [ inference ]. The resulting column(s) will be of class CharacterList, rather than vector(s). Please contact the maintainer if multi-valuedness is expected/meaningful for the listed field(s).