ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

Incorrect Data Frame Column Names for Molgula manhattensis #104

Closed Jotanator closed 2 weeks ago

Jotanator commented 3 months ago

Using the BOLD API (latest stable version) to search for different species genus we noticed that for one of them we were running into errors. At first it seemed like an issue with missing columns in the data frame returned by the bold API. However, upon closer inspection I noticed that it isn't an issue of missing columns or missing data, the problem lies in the naming of the columns of the data frame.

Normally, when requesting a species such as gallus gallus using bold_seqspec function we get the following information:

records_bold <- bold_seqspec(taxon = "Gallus gallus")

Screenshot 2024-03-22 at 4 36 26 PM

However, when searching Molgula manhattensis we get the following:

records_bold_error <- bold_seqspec(taxon = "Molgula manhattensis")

Screenshot 2024-03-22 at 4 39 33 PM

Notice that all the columns are named incorrectly, for some reason it seems the names of the column are assigned the information of the first entry in Molgula manhattensis in BOLD.

Session Info ```r R version 4.2.3 (2023-03-15) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS 14.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] zip_2.3.1 treemapify_2.5.5 plotly_4.10.2 bold_1.3.0 mpoly_1.1.1 ipc_0.1.4 promises_1.2.1 [8] future_1.33.0 rlist_0.4.6.2 RSQLite_2.3.1 taxize_0.9.100 rentrez_1.2.3 shinyBS_0.61.1 modules_0.12.0 [15] shinyalert_3.0.0 shinydashboard_0.7.2 vembedr_0.1.5 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3 [22] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0 shinyWidgets_0.8.0 [29] shinycssloaders_1.0.0 shinyjs_2.1.0 shiny_1.7.5 loaded via a namespace (and not attached): [1] colorspace_2.1-0 ellipsis_0.3.2 httpcode_0.3.0 rstudioapi_0.15.0 listenv_0.9.0 urltools_1.7.3 ggfittext_0.10.1 [8] DT_0.29 bit64_4.0.5 fansi_1.0.4 mathjaxr_1.6-0 xml2_1.3.5 codetools_0.2-19 partitions_1.10-7 [15] cachem_1.0.8 polynom_1.4-1 jsonlite_1.8.7 compiler_4.2.3 httr_1.4.7 backports_1.4.1 fastmap_1.1.1 [22] lazyeval_0.2.2 cli_3.6.1 later_1.3.1 htmltools_0.5.6 tools_4.2.3 gmp_0.7-2 gtable_0.3.4 [29] glue_1.6.2 Rcpp_1.0.11 jquerylib_0.1.4 vctrs_0.6.3 crul_1.4.0 ape_5.7-1 nlme_3.1-162 [36] conditionz_0.1.0 iterators_1.0.14 crosstalk_1.2.0 globals_0.16.2 rbibutils_2.2.15 timechange_0.2.0 mime_0.12 [43] lifecycle_1.0.3 XML_3.99-0.14 zoo_1.8-12 scales_1.2.1 hms_1.1.3 parallel_4.2.3 yaml_2.3.7 [50] curl_5.0.2 memoise_2.0.1 sass_0.4.7 triebeard_0.4.1 stringi_1.7.12 foreach_1.5.2 orthopolynom_1.0-6.1 [57] filelock_1.0.2 Rdpack_2.5 rlang_1.1.1 pkgconfig_2.0.3 lattice_0.20-45 fontawesome_0.5.2 htmlwidgets_1.6.2 [64] bit_4.0.5 tidyselect_1.2.0 parallelly_1.36.0 plyr_1.8.8 magrittr_2.0.3 R6_2.5.1 generics_0.1.3 [71] base64url_1.4 txtq_0.2.4 DBI_1.1.3 pillar_1.9.0 withr_2.5.0 crayon_1.5.2 uuid_1.1-1 [78] utf8_1.2.3 tzdb_0.4.0 grid_4.2.3 data.table_1.14.8 blob_1.2.4 digest_0.6.33 xtable_1.8-4 [85] httpuv_1.6.11 munsell_0.5.0 viridisLite_0.4.2 bslib_0.5.1 ```
salix-d commented 3 months ago

That is what it does. Well, actually, it's the information of the second entry on line 5 of the tsv returned by the BOLD API. The first entry has return characters in the 'copyright_licenses' field that messes up the format.

> records_bold_error <- bold_seqspec(taxon = "Molgula manhattensis", response = TRUE)
> tmp <- records_bold_error$content |> rawToChar() |> stringi::stri_split_lines1()
> stringi::stri_count_regex(tmp, "\t")
 [1] 79 65  0 14 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79
[32] 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79
> tmp[1:4]
[1] "processid\tsampleid\trecordID\tcatalognum\tfieldnum\tinstitution_storing\tcollection_code\tbin_uri\tphylum_taxID\tphylum_name\tclass_taxID\tclass_name\torder_taxID\torder_name\tfamily_taxID\tfamily_name\tsubfamily_taxID\tsubfamily_name\tgenus_taxID\tgenus_name\tspecies_taxID\tspecies_name\tsubspecies_taxID\tsubspecies_name\tidentification_provided_by\tidentification_method\tidentification_reference\ttax_note\tvoucher_status\ttissue_type\tcollection_event_id\tcollectors\tcollectiondate_start\tcollectiondate_end\tcollectiontime\tcollection_note\tsite_code\tsampling_protocol\tlifestage\tsex\treproduction\thabitat\tassociated_specimens\tassociated_taxa\textrainfo\tnotes\tlat\tlon\tcoord_source\tcoord_accuracy\telev\tdepth\telev_accuracy\tdepth_accuracy\tcountry\tprovince_state\tregion\tsector\texactsite\timage_ids\timage_urls\tmedia_descriptors\tcaptions\tcopyright_holders\tcopyright_years\tcopyright_licenses\tcopyright_institutions\tphotographers\tsequenceID\tmarkercode\tgenbank_accession\tnucleotides\ttrace_ids\ttrace_names\ttrace_links\trun_dates\tsequencing_centers\tdirections\tseq_primers\tmarker_codes"
[2] "BNSB097-21\tBNSB0097\t14077558\t\tW121_CU\tDeutsches Zentrum fuer Marine Biodiversitaetsforschung\t\tBOLD:ACB4470\t18\tChordata\t61\tAscidiacea\t232\tStolidobranchia\t101156\tMolgulidae\t\t\t210801\tMolgula\t505893\tMolgula manhattensis\t\t\tWiebke Stamerjohanns\tMorphology, Barcoding\tDeKay, 1843\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t54.125\t8.855\t\t\t\t\t\t\tGermany\t\tBuesum\t\tTaoro boat\t7333897|7333898\thttp://www.boldsystems.org/pics/BNSB/W121-12.01.21+1610474376.jpg|http://www.boldsystems.org/pics/BNSB/W121-12.01.21-1+1610474270.jpg\tOverview|Overview\t|\tWiebke Stamerjohanns|Wiebke Stamerjohanns\t2022|2022\tCreativeCommons \x96 Attribution"                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
[3] "Non-Commercial Share-Alike|CreativeCommons \x96 Attribution"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[4] "Non-Commercial Share-Alike\tGerman Centre for Marine Biodiversity Research, Senckenberg am Meer|German Centre for Marine Biodiversity Research, Senckenberg am Meer\tWiebke Stamerjohanns|Wiebke Stamerjohanns\t15078645\tCOI-5P\t\tTACTTTATATTTTATTTTTGGTACATTCGCTGCATTAATTGGTTCCGCTTTGAGTGGAGTTTTGCGGTTAGAATTATCCCAAACAGGAGTTGTTATAATAAATAGCAATATGTATAATATAGTTATTACCTCTCATGCTTTAGTTATAATTTTTTTTTTTGTAATACCTATTACAATAAGGAGATTTGGGAATTGGCTAATTCCTCTTTTTATGAGATGTCCTGATATGGCTTTTCCTCGTATAAATAATTTTTCTTTTTGGTTACTTCCTTTTTCTTTTAGTTTATTATTACTTAGTGGTTTTATGAATATGAGAGTTGGGGCAGGGTGGACCATTTACCCTCCTCTATCTTCTATTTTGAGACATCCTAGAATTCAGATGGATTTTGCTATTTTTAGTCTACATTTGGCTAGAATTAGTAGTATTCTTTCTTCTATTAATTTTATAGTAACCATTTTAAATATATCTCCTAAAGGAATAAAAATTTTTCATTTATCTTTAATAATGTGAAGTATTTTTATTACAGCTGTTTTACTTTTATTATCATTACCAGTATTGGCTGGGGCCATTACTATGTTATTATTTGATCGTAATATTAATACTATGTTTTTTGATCCTGCAGGAGGGGGAGATCCAATCTTATTCCAACATCTCTTT\t\t\t\t\t\t\t\t" 

I'll notify BOLD of this error. I know they are working on a new API, so I don't know if they'll fix it on this one.

I might be able to code a check to detect and fix those though.

salix-d commented 3 months ago

Do you have other species names that return this error?

Jotanator commented 3 months ago

I don't have any others yet but I will let you know if I find any. I have a huge list of species at hand and that was one of them.

paulapappalardo commented 2 weeks ago

Hi! I just run into this same error with the names Jassa slatteryi and Molgula manhattensis, came here to see if someone else had seen this issue.

salix-d commented 2 weeks ago

Hi @paulapappalardo and @Jotanator

Could one of you (or both) install the '104-incorrect-data-frame-column-names-for-molgula-manhattensis' branch to test if the fix I tried works for you?

remotes::install_github("ropensci/bold@104-incorrect-data-frame-column-names-for-molgula-manhattensis")

If so, I'll push the change to master! BOLD didn't get back to me, so for now I'll have to work around their API issues.

paulapappalardo commented 2 weeks ago

Done, and it works! I tested it for the two species I found it tripped, Jassa slatteryi and Molgula manhattensis (that you did the fix for). Thank you for the quick reply and great job with the fix 🙂

salix-d commented 2 weeks ago

Thanks for testing 🙂