ropensci / rfishbase

R interface to the fishbase.org database
https://docs.ropensci.org/rfishbase
111 stars 40 forks source link

Missing diet values - Data base update? #281

Open smorsbach opened 1 month ago

smorsbach commented 1 month ago
Session Info ```r > sessionInfo() R version 4.4.1 (2024-06-14) Platform: aarch64-apple-darwin20 Running under: macOS Sonoma 14.5 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: Europe/Berlin tzcode source: internal attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rglobi_0.3.4 rfishbase_4.1.1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 loaded via a namespace (and not attached): [1] utf8_1.2.4 generics_0.1.3 bitops_1.0-8 stringi_1.8.4 hms_1.1.3 digest_0.6.36 magrittr_2.0.3 evaluate_0.24.0 grid_4.4.1 timechange_0.3.0 contentid_0.0.18 fastmap_1.2.0 jsonlite_1.8.8 progress_1.2.3 [15] DBI_1.2.3 httr_1.4.7 fansi_1.0.6 scales_1.3.0 duckdb_1.0.0-2 cli_3.6.3 rlang_1.1.4 crayon_1.5.3 dbplyr_2.5.0 bit64_4.0.5 munsell_0.5.1 cachem_1.1.0 remotes_2.5.0 withr_3.0.1 [29] yaml_2.3.8 parallel_4.4.1 tools_4.4.1 tzdb_0.4.0 memoise_2.0.1 colorspace_2.1-0 curl_5.2.1 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 bit_4.0.5 fs_1.6.4 vroom_1.6.5 pkgconfig_2.0.3 [43] pillar_1.9.0 gtable_0.3.5 glue_1.7.0 xfun_0.45 tidyselect_1.2.1 rstudioapi_0.16.0 knitr_1.47 htmltools_0.5.8.1 rmarkdown_2.27 compiler_4.4.1 prettyunits_1.2.0 askpass_1.2.0 openssl_2.2.0 RCurl_1.98-1.16 ```

Hello!

I am right now trying to retrieve diet information tables for dozens of species from SeaLifeBase. However, many of these don't contain data that can be found online. I am particularly interested in the PercentEmpty column, which is returned when running the diet() command. Please find an example below; this is one of the cases where the mentioned column is empty, despite containing data in the online version.

Code Example ```r > diet(species_list = "Larus argentatus", server = "sealifebase") %>% select(Species, DietRefNo, PercentEmpty) Joining with `by = join_by(SpecCode)` # A tibble: 7 × 3 Species DietRefNo PercentEmpty 1 Larus argentatus 4276 NA 2 Larus argentatus 4276 NA 3 Larus argentatus 4276 NA 4 Larus argentatus 6390 NA 5 Larus argentatus 6515 NA 6 Larus argentatus 6570 NA 7 Larus argentatus 6587 NA ```

The latest release is from May 2023, could it have something to do with that? And if yes, are there any updates planned in the near future? Or is this command simply faulty when running it on SeaLifeBase? However, when running > fb_tbl("diet", server="sealifebase"), there are entries with values in that column.

Any help would be much appreciated. Thanks in advance!

cboettig commented 1 month ago

yup, working on preparing the new release. should be up soon

cboettig commented 1 month ago

Package plumbing still in the works, but can you try:

diet = arrow::read_parquet("https://fishbase.ropensci.org/sealifebase/diet.parquet")

This directly accesses our API and should provide the current data (July 2024 release) for the sealifebase diet table.

smorsbach commented 1 month ago

Thanks for the quick answer!

I noticed something else though and I think this issue has been raised in the past (#87), namely the combination of information from the diet() and the diet_items(). Is there a way to automatically include the more detailed diet composition data from diet_items() in the diet table? It took me a long time to figure out how to get from one to the other before realizing that you can simply join them via the DietCode column and I think it would be really helpful to just combine the two into one. No information is lost and it makes the structure of the dozens of tables a bit more tidy. For anyone encountering the same issue, I added my code below.

Table Join ```r # pull all diet items from SeaLifeBase diet_items <- fb_tbl(tbl = "diet_items", server = "sealifebase") # get diet for single or list of species; here harbour porpoise diet_species <- diet(species_list = "Phocoena phocoena", server = "sealifebase") # join tables and select columns diet_species_detail <- diet_species %>% left_join(diet_items, by = "DietCode") %>% select(Species, SampleStage, DietCode, C_Code, Locality, January, February, March, April, May, June, July, August, September, October, November, December, E_Code, FoodI, FoodII, FoodIII, ItemName, Stage, DietPercent) ```

I think in issue #87 an attempt was made to do that, but right now that leads to the diet table displaying wrong or incomplete values in the PercentEmpty and OtherItems columns. I tried to include an example for Atlantic cod below and hope that this clarifies what I mean. Maybe I also misunderstood what these two columns should contain. Also, the diet() table per se does not contain any taxonomic column on the actual food item. In the online version, there is a column called "Main Food" which is missing in the R version. That emphasizes the benefit of merging diet() and diet_items() into one.

Example ```r # pull all diet items from FishBase diet_items <- diet_items() # get diet for cod diet_species <- diet(species_list = "Gadus morhua", server = "fishbase") # join tables diet_species_detail <- diet_species %>% left_join(diet_items, by = "DietCode") %>% select(Species, SampleStage, DietCode, FoodI, FoodII, FoodIII, ItemName, Stage, DietPercent, OtherItems, PercentEmpty) %>% filter(DietCode == 14) # PercentEmpty and OtherItems contain low values which can neither be found # in the detailed diet information nor in the current table version online ```

The code chunk above brings me to my last point, namely the diet_items() command. As for now, there is no option to specify whether the data should be downloaded from FishBase or SeaLifeBase, it automatically just uses FishBase. It is possible to get the diet_items() from SeaLifeBase by using fb_tbl("diet_items", "sealifebase"), but maybe add the server option to diet_items() as well?

Really appreciate you work and thanks in advance!

cboettig commented 1 month ago

Thanks for sharing your experience and examples, I am sure this will be helpful to other users as well.

At the moment I recommend using the fb_tbl function for everything, e.g. fb_tbl(tbl = "diet_items", server = "sealifebase"). This function covers all ~200+ tables from either fishbase or sealifebase explicitly, and avoids the potential confusion of setting the 'server' (which is a deprecated syntax that dates back to an entirely different function architecture under the hood over a decade ago, when we could access far fewer tables and needed a very different mechanism to do so).

I agree that it takes a while to get used to the idea of joining separate tables. But this approach that Hadley Wickham calls "Tidy Data", also known as "relational database design", where everything is just a table and all operations can be expressed by functions like joins, selects, and filters, is in fact incredibly powerful and incredibly general once you get used to it. Knowing that everything is a just a table and can be joined to other tables with key columns is in the end much more useful than relying on helper utilities like our old diet() function that could do some joins and filters internally. As you see, the dplyr package has made these operations relatively easy and intuitive, and the internet (and now the chatbots) are all well-acquainted with advice on how to use dplyr. It didn't exist when we started rfishbase, but now I highly recommend using precisely the patterns you show above, and just using fb_tbl() as the only rfishbase function to access any table. HTH.

(rfishbase still needs to be pointed to the 2024 data, new release still coming soon!)