gmbecker / genbankr

http://bioconductor.org/packages/devel/bioc/html/genbankr.html
14 stars 9 forks source link

Errors parsing plastid genbank records #8

Open nilsj9 opened 4 years ago

nilsj9 commented 4 years ago

Hi @gmbecker , currently I am attempting to parse a bunch of plastid genome records using genbankr. Thereby I am encountering recurring error messages and wonder wheter it is caused by a bug in genbankr or by wrong formatted GenBank Flat files. In the following I am listing three frequent error messages:

genbankr::readGenBank(genbankr::GBAccession("NC_033333"))
Error in `[[<-`(`*tmp*`, name, value = c("BWX36_gp082.1", "BWX36_gp082.1",  : 
  28 elements in value to replace 44 elements

genbankr::readGenBank(genbankr::GBAccession("NC_029719"))
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") : 
  In range 13: at least two out of 'start', 'end', and 'width', must
  be supplied.
In addition: Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion

genbankr::readGenBank(genbankr::GBAccession("NC_017894"))
Error : subscript contains NAs

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5                  lattice_0.20-41             prettyunits_1.1.1           Rsamtools_2.4.0             Biostrings_2.56.0          
 [6] assertthat_0.2.1            digest_0.6.25               BiocFileCache_1.12.0        R6_2.4.1                    GenomeInfoDb_1.24.2        
[11] stats4_4.0.2                RSQLite_2.2.0               httr_1.4.2                  pillar_1.4.6                zlibbioc_1.34.0            
[16] rlang_0.4.7                 GenomicFeatures_1.40.1      progress_1.2.2              curl_4.3                    rentrez_1.2.2              
[21] blob_1.2.1                  S4Vectors_0.26.1            Matrix_1.2-18               BiocParallel_1.22.0         stringr_1.4.0              
[26] RCurl_1.98-1.2              bit_1.1-15.2                biomaRt_2.44.1              DelayedArray_0.14.1         compiler_4.0.2             
[31] rtracklayer_1.48.0          pkgconfig_2.0.3             askpass_1.1                 BiocGenerics_0.34.0         openssl_1.4.2              
[36] tidyselect_1.1.0            SummarizedExperiment_1.18.2 tibble_3.0.3                GenomeInfoDbData_1.2.3      IRanges_2.22.2             
[41] matrixStats_0.56.0          XML_3.99-0.5                crayon_1.3.4                dplyr_1.0.0                 dbplyr_1.4.4               
[46] GenomicAlignments_1.24.0    bitops_1.0-6                rappdirs_0.3.1              grid_4.0.2                  jsonlite_1.7.0             
[51] lifecycle_0.2.0             DBI_1.1.0                   magrittr_1.5                stringi_1.4.6               XVector_0.28.0             
[56] ellipsis_0.3.1              generics_0.0.2              vctrs_0.3.2                 tools_4.0.2                 bit64_0.9-7                
[61] BSgenome_1.56.0             Biobase_2.48.0              glue_1.4.1                  purrr_0.3.4                 hms_0.5.3                  
[66] parallel_4.0.2              AnnotationDbi_1.50.1        GenomicRanges_1.40.0        memoise_1.1.0               genbankr_1.16.0            
[71] VariantAnnotation_1.34.0 

I would be very grateful if you could help me fix these problems. Thank you in advance and best wishes.

kathooks commented 2 years ago

Hi @gmbecker , hi @nilsj9

I have the first of the issues with a bunch of human RefSeq identifiers, e.g.:

genbankr::readGenBank(genbankr::GBAccession("NM_000494"))
Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
Annotations don't have 'locus_tag' label, using 'gene' as gene_id column
 Error in `[[<-`(`*tmp*`, name, value = c("COL17A1.1", "COL17A1.1", "COL17A1.1",  : 
  53 elements in value to replace 56 elements

It originates from genbankReader.R, line 873. Works when replacing with:

exns$transcript_id = cdss$transcript_id