ropensci / bold

Interface to the Bold Systems barcode webservice
https://docs.ropensci.org/bold
Other
17 stars 11 forks source link

bold_seq() does not return all COI-5P sequences for genus Homo and species Homo sapiens #90

Closed jphill01 closed 1 year ago

jphill01 commented 1 year ago

I want to download all Homo sapiens COI-5P data from BOLD.

bold_stats("Homo sapiens") indicates there are 48417 such records in BOLD. However, only 48411 are returned by bold_seq("Homo sapiens", "COI-5P").

Is this an issue with the large data request note indicated in the function documentation?

I tried also with the genus Homo, which comprises 48449 records, but only 48443 are retrieved using bold_seq().

Session Info R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] bold_1.2.0 loaded via a namespace (and not attached): [1] compiler_4.2.1 magrittr_2.0.3 plyr_1.8.7 R6_2.5.1 [5] tools_4.2.1 httpcode_0.3.0 curl_4.3.2 urltools_1.7.3 [9] Rcpp_1.0.9 triebeard_0.3.0 xml2_1.3.3 stringi_1.7.8 [13] reshape_0.8.9 crul_1.3 stringr_1.4.0 jsonlite_1.8.0
salix-d commented 1 year ago

Hi,

the reason is that bold has records private and public records. Their API returns all records with stats function, but you can check specifically for public COI-5P records with bold_tax_id("12439", dataTypes = "stats") (you can get the tax id with bold_tax_name("Homo sapiens")) and it will show that there are 48411 public records.

> bold::bold_tax_name("Homo sapiens")
  taxid        taxon tax_rank tax_division parentid parentname     taxonrep specimenrecords representitive_image.image
1 12439 Homo sapiens  species     Animalia     4523       Homo Homo sapiens           48704 BIOP/hebert+1354300515.jpg
  representitive_image.apectratio        input
1                           1.502 Homo sapiens
> bold::bold_tax_id("12439", dataTypes = "stats")
  input publicmarkersequences.COI.5P publicmarkersequences.atp6 publicmarkersequences.COI.3P publicmarkersequences.COII
1 12439                        48411                       2095                            4                       2095
  publicmarkersequences.COI.PSEUDO publicmarkersequences.COXIII publicmarkersequences.CYTB publicmarkersequences.D.loop publicmarkersequences.ND1
1                                1                         2096                       2096                         2095                         1
  publicmarkersequences.ND2 publicmarkersequences.ND3 publicmarkersequences.ND4 publicmarkersequences.ND4L publicmarkersequences.ND5.0
1                         1                         1                         1                          1                           1
  publicmarkersequences.ND6 publicrecords publicspecies publicsubspecies publicbins specimenrecords sequencedspecimens barcodespecimens species
1                         1         48417             1                1          1           48704              59069            47876       1
  barcodespecies
1              1
salix-d commented 1 year ago

I could make a note of that in the docs, that the bold_stats function includes private records.

jphill01 commented 1 year ago

I think that would help dispel any confusion.

salix-d commented 1 year ago

Revisiting this issue, I just realised my explanation wasn't true. As you can see in the code bloc of my previous reply (copied below), there are indeed 48417 public records, it's just that not all records are COI-5P.

> bold::bold_tax_name("Homo sapiens")
  taxid        taxon tax_rank tax_division parentid parentname     taxonrep specimenrecords representitive_image.image
1 12439 Homo sapiens  species     Animalia     4523       Homo Homo sapiens           48704 BIOP/hebert+1354300515.jpg
  representitive_image.apectratio        input
1                           1.502 Homo sapiens
> bold::bold_tax_id("12439", dataTypes = "stats")
  input publicmarkersequences.COI.5P publicmarkersequences.atp6 publicmarkersequences.COI.3P publicmarkersequences.COII
1 12439                        48411                       2095                            4                       2095
  publicmarkersequences.COI.PSEUDO publicmarkersequences.COXIII publicmarkersequences.CYTB publicmarkersequences.D.loop publicmarkersequences.ND1
1                                1                         2096                       2096                         2095                         1
  publicmarkersequences.ND2 publicmarkersequences.ND3 publicmarkersequences.ND4 publicmarkersequences.ND4L publicmarkersequences.ND5.0
1                         1                         1                         1                          1                           1
  publicmarkersequences.ND6 publicrecords publicspecies publicsubspecies publicbins specimenrecords sequencedspecimens barcodespecimens species
1                         1         48417             1                1          1           48704              59069            47876       1
  barcodespecies
1              1

Same thing with the taxa "Homo" :

> bold_tax_id2("4523", dataTypes = "stats")
  input publicrecords publicspecies publicsubspecies publicbins specimenrecords sequencedspecimens
1  4523         48455             4                1          1           48743              59141
  barcodespecimens species barcodespecies publicmarkersequences.COI.5P publicmarkersequences.atp6
1            47912       4              4                        48449                       2096
  publicmarkersequences.COI.3P publicmarkersequences.COII publicmarkersequences.COI.PSEUDO
1                            4                       2099                                1
  publicmarkersequences.COXIII publicmarkersequences.CYTB publicmarkersequences.D.loop
1                         2100                       2100                         2095
  publicmarkersequences.ND1 publicmarkersequences.ND2 publicmarkersequences.ND3
1                         4                         4                         4
  publicmarkersequences.ND4 publicmarkersequences.ND4L publicmarkersequences.ND5.0
1                         4                          4                           4
  publicmarkersequences.ND6
1                         4

48455 public records, 48449 COI-5P records.