ropensci / rfishbase

R interface to the fishbase.org database
https://docs.ropensci.org/rfishbase
111 stars 40 forks source link

SeaLifeBase species not validating #220

Closed reedmiller17 closed 1 year ago

reedmiller17 commented 3 years ago

Hi folks! Thanks for this package. I've pulled data from FishStatJ for marine catches at the species level, and am aiming to associate data from FishBase / SeaLifeBase with it. See attached list of species.

I've been able to validate_names for many of the marine fishes using the FishBase server, but am finding that when I try to set the server to SeaLifeBase to search for crustaceans etc, it seems to return the same results.

marinespec <- read.delim("FAO FishStat Marine specieslist.csv") Sys.setenv(FISHBASE_API="sealifebase") #as suggested in package documentation marinespec$valid <- validate_names(marinespec$SCIENTIFIC_NAME) marinespec$valid2 <- validate_names(marinespec$SCIENTIFIC_NAME, server = getOption("FISHBASE_API", "sealifebase")) #tried this way as well

Here are some examples of species that return NA, but are found on the SeaLifeBase: Alaria esculenta , https://www.sealifebase.ca/summary/Alaria-esculenta.html# Erimacrus isenbeckii , https://www.sealifebase.ca/summary/Erimacrus-isenbeckii.html

I've noticed the search function on SeaLifeBase website is quite slow, so maybe that's something to do with it?

In general, it would be really handy if the FAO FishStatJ folks published a concordance list between their species and those in FishBase / SeaLifeBase; I think it would save a lot of headaches.

--

FAO FishStat Marine specieslist.txt

cboettig commented 3 years ago

Thanks @reedmiller17 . We get periodic MySQL dumps from the FishBase / SeaLifeBase team, the most recent one from SLB is from April 2021. (If you haven't updated recently you may still have the version from April 2019; there was no 2020 release do to other stuff happening in the past year). That latency is usually the culprit for differences, but things may have gone wrong in the import. I think you can check the date an entry was entered/modified but have forgotten where that interface is on th website. It is definitely worth reaching out to the official FishBase /SeaLifeBase team (though they occasionally answer here as well)

As you know, validating names in general is tricky problem, with different data providers having different references for name validation, which makes data synthesis particularly difficult. Even when concordance lists are published, it can be difficult to keep them current when providers update their names. It may be entirely unhelpful, but recently I have tried to focus on name validation against other naming authorities, such as Catalogue of Life (which I believe draw from both of these), as well as other providers such as ITIS and NCBII. We have separate packages focused on working with that data, e.g. see taxalight and taxadb. e.g.:

library(tidyverse)
library(taxalight)
sp <- read_tsv("https://github.com/ropensci/rfishbase/files/6963958/FAO.FishStat.Marine.specieslist.txt")

col <- tl(sp$SCIENTIFIC_NAME, "col") # Catalogue of Life names
itis <- tl(sp$SCIENTIFIC_NAME, "itis") # itis

# We can see that 411 names did not resolve in ITIS
left_join(sp, itis, by=c("SCIENTIFIC_NAME" = "scientificName")) %>% filter(is.na(taxonID)) 

# Note that ITIS considers some of the provided names to be recognized synonyms, and provides
# us with the acceptedNameUsageID for these names
itis %>% count(taxonomicStatus)

# Resolve acceptedNameUsageID to actual names:
tl(itis$acceptedNameUsageID, "itis")
reedmiller17 commented 3 years ago

Hi Carl, Thanks so much for your reply! That's useful to know there are other approaches for validating names against other databases.

A couple follow up thoughts:

Do you have suggestions for next steps with accessing SeaLifeBase data? I'm not sure if the problem is particular to me, or if others are also not able to access the non-fish data there.

Thanks again for your help!

mmc1.txt

cboettig commented 3 years ago

Thanks @reedmiller17 for the reply, this is all very helpful.

The most recent version of rfishbase is 3.1.9. Make sure rfishbase::available_releases() shows "21.04" as the first option and not "19.04".

Thanks for trying out taxalight and reporting the error. Is this on a windows platform? We may have some more debugging to do there. Meanwhile you could try skipping col and go with ITIS. Or, here's the same code using taxadb, which can be a little bit slower and more RAM-hungry:

library(tidyverse)
library(taxadb)
sp <- read_tsv("https://github.com/ropensci/rfishbase/files/6963958/FAO.FishStat.Marine.specieslist.txt")

itis <- filter_name(sp$SCIENTIFIC_NAME, "itis") # itis

# We can see that 411 names did not resolve in ITIS
left_join(sp, itis, by=c("SCIENTIFIC_NAME" = "scientificName")) %>% filter(is.na(taxonID)) 

# Note that ITIS considers some of the provided names to be recognized synonyms, and provides
# us with the acceptedNameUsageID for these names
itis %>% count(taxonomicStatus)

# Resolve acceptedNameUsageID to actual names:
filter_id(itis$acceptedNameUsageID, "itis")

Of course, resolving to accepted ITIS names doesn't guarantee it will match the names fishbase uses, it might be worth trying all names that come back in the itis table (including synonyms). But it does usually help.

reedmiller17 commented 3 years ago

Hi again,

Thanks for the heads up, I updated to 3.1.9 . With this version, I did not have the error with col (but it only returned 2 results!).

Still not having luck with validating any non-fish with the SeaLifeBase server, including scientific names accepted by ITIS.

Thanks!

cboettig commented 3 years ago

thanks @reedmiller17 . Can you provide us with a reprex and the output of sessionInfo()?

reedmiller17 commented 3 years ago

Hi there! Here is sessionInfo, and below is reprex. Thanks again!

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reprex_2.0.1    taxadb_0.1.3    forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4     readr_2.0.0    
 [8] tidyr_1.1.3     tibble_3.1.3    ggplot2_3.3.5   tidyverse_1.3.1 taxalight_0.1.4 rfishbase_3.1.9

loaded via a namespace (and not attached):
 [1] httr_1.4.2        contentid_0.0.12  bit64_4.0.5       vroom_1.5.3       jsonlite_1.7.2    modelr_0.1.8     
 [7] assertthat_0.2.1  askpass_1.1       highr_0.9         blob_1.2.2        cellranger_1.1.0  yaml_2.2.1       
[13] progress_1.2.2    pillar_1.6.2      RSQLite_2.2.7     backports_1.2.1   glue_1.4.2        digest_0.6.27    
[19] rvest_1.0.1       colorspace_2.0-2  htmltools_0.5.1.1 arkdb_0.0.12      clipr_0.7.1       pkgconfig_2.0.3  
[25] broom_0.7.9       haven_2.4.3       scales_1.1.1      processx_3.5.2    tzdb_0.1.2        openssl_1.4.4    
[31] generics_0.1.0    ellipsis_0.3.2    cachem_1.0.5      withr_2.4.2       cli_3.0.1         magrittr_2.0.1   
[37] crayon_1.4.1      readxl_1.3.1      ps_1.6.0          memoise_2.0.0     evaluate_0.14     storr_1.2.5      
[43] fs_1.5.0          fansi_0.5.0       xml2_1.3.2        tools_4.1.0       gh_1.3.0          prettyunits_1.1.1
[49] hms_1.1.0         lifecycle_1.0.0   munsell_0.5.0     callr_3.7.0       compiler_4.1.0    duckdb_0.2.8     
[55] rlang_0.4.11      grid_4.1.0        rstudioapi_0.13   rmarkdown_2.10    gtable_0.3.0      DBI_1.1.1        
[61] curl_4.3.2        R6_2.5.0          lubridate_1.7.10  knitr_1.33        fastmap_1.1.0     thor_1.1.2       
[67] bit_4.0.4         utf8_1.2.2        stringi_1.7.3     Rcpp_1.0.7        vctrs_0.3.8       dbplyr_2.1.1     
[73] tidyselect_1.1.1  xfun_0.24 

Here is a reprex:

library(rfishbase)

library(tidyverse)
library(taxadb)
library(taxalight)
#> 
#> Attaching package: 'taxalight'
#> The following objects are masked from 'package:taxadb':
#> 
#>     get_ids, get_names
sp <- read_tsv("https://github.com/ropensci/rfishbase/files/6963958/FAO.FishStat.Marine.specieslist.txt")
#> Rows: 2197 Columns: 9
#> -- Column specification --------------------------------------------------------
#> Delimiter: "\t"
#> chr (8): ISSCAAP_DIVISION, ISSCAAP_GROUP, FAOSTAT_GROUP_OF_SPECIES, CPC_CLAS...
#> dbl (1): COUNT
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.

#initial attempts
Sys.setenv(FISHBASE_API="fishbase")
sp$valid = validate_names(sp$SCIENTIFIC_NAME) #2197 rows
sum(is.na(sp$valid)) #929 NA; mostly non-fish and "___ spp" for NEI
#> [1] 929

Sys.setenv(FISHBASE_API="sealifebase")
sp$valid_se = validate_names(sp$SCIENTIFIC_NAME) #2197 rows
sum(is.na(sp$valid_se)) #929 NA; same set as fishbase
#> [1] 929

#match with ITIS & COL
col <- tl(sp$SCIENTIFIC_NAME, "col") # Catalogue of Life names #2 obs
itis <- tl(sp$SCIENTIFIC_NAME, "itis") # itis #3799 obs

# We can see that 411 names did not resolve in ITIS
itis_check <- left_join(sp, itis, by=c("SCIENTIFIC_NAME" = "scientificName")) %>% filter(is.na(taxonID)) 

# Note that ITIS considers some of the provided names to be recognized synonyms, and provides us with the acceptedNameUsageID for these names
itis %>% count(taxonomicStatus)
#>   taxonomicStatus    n
#> 1        accepted 1685
#> 2         synonym 2114

# Resolve acceptedNameUsageID to actual names: 
itis_act <- tl(itis$acceptedNameUsageID, "itis") #1772 obs

#attempt to validate ITIS scientific names
itis_sciname <- itis$scientificName #3799 obs

Sys.setenv(FISHBASE_API="fishbase")
itis_valid <- validate_names(itis_sciname) #3801 obs (unsure why > 3799)
itis_valid <- as.data.frame(cbind(itis_sciname, itis_valid[1:3799]))
sum(is.na(itis_valid[,2])) #1428 of 3799
#> [1] 1428

Sys.setenv(FISHBASE_API="sealifebase")
itis_valid_se <- validate_names(itis_sciname) #3801 obs
# itis_valid_se <- as.data.frame(cbind(itis_sciname, itis_valid_se[1:3799]))
itis_valid$valid_se <- itis_valid_se[1:3799]
sum(is.na(itis_valid$valid_se)) #1428 of 3799
#> [1] 1428

#merge with sp
colnames(itis_valid) <- c("SCIENTIFIC_NAME", "valid_fb", "valid_slb")
itis_valid <- full_join(sp, itis_valid)
#> Joining, by = "SCIENTIFIC_NAME"
cboettig commented 3 years ago

Ok, consider this:

(note we can use server="sealifebase" in functions below instead of the painful nonsense with Sys.setenv)

library(rfishbase)
library(tidyverse)
library(taxalight)

sp <- read_tsv("https://github.com/ropensci/rfishbase/files/6963958/FAO.FishStat.Marine.specieslist.txt")

nrow(sp) # 2197 names

## Let's use taxalight to expand this list to include synonyms recognized by ITIS
itis <- tl(sp$SCIENTIFIC_NAME, "itis") %>% as_tibble()

dim(itis)[1]
## But how many unique taxa do these names correspond to?
itis %>% select(acceptedNameUsageID) %>% distinct() %>% nrow()
# 1772

## Lets resolve all matched names in ITIS against SLB's synonym table directly instead:
slb_taxa <- synonyms(server="sealifebase")
slb_resolved <- itis %>% left_join(slb_taxa, by = c("scientificName"="synonym"))

## Extract the SpecCode of all recognized names, and resolve it against sealifebase taxa table to get the SLB accepted name
slb_id <- slb_resolved %>% select(SpecCode) %>% distinct()
## Resolve those ids to SLB accepted species names
slb_accepted <- species(server = "sealifebase", fields=c("SpecCode", "Species"))
slb_name <- slb_id %>% left_join(slb_accepted)

nrow(slb_name)
# 385 unique the names are accepted in SLB

385 names from sealifebase isn't too bad. We could work a bit harder on the unresolved names of course, this approach is still pretty primitive, relying on exact string matching of the names -- which breaks when species names are given in different case, different character encoding, or when they include the authority/publication reference in the name, etc. Standardizing names before attempting any of the table joins above can help this. (e.g. see taxadb::clean_names()).

The above strategy with taxalight also only attempts to identify names on your list which ITIS considers synonyms and return the accepted name. In particular, if the name is already an accepted name, it is not giving you all known synonyms for that name. You may want to try this as well (see taxadb which is more flexible in this regard), but be warned, crosswalking synonyms like that across naming providers can lead to taxonomic nonsense... (e.g. it is common to discover that provider I considers A, B, & C to synonyms, where provider II considers names A and B to be accepted names of distinct species -- this can 'create' or 'destroy' species from your species list).

Unfortunately it always ends up being a laborious process getting the last handful of names to resolve before you can really conclude that names on your species list have no match in the corresponding database.

here's the same as above in rfishbase.

## Repeat (more consisely) using fishbase
fb_taxa <- species(fields=c("SpecCode", "Species"))
fb_name <- itis %>% 
  left_join(synonyms(server="fishbase"), by = c("scientificName"="synonym")) %>%
  select(SpecCode) %>% distinct() %>% filter(!is.na(SpecCode)) %>%
  left_join(fb_taxa) %>% distinct()

nrow(fb_name)
#  1570 fb names
cboettig commented 3 years ago

p.s. I agree with https://github.com/ropensci/rfishbase/issues/212 that validate_names() ought to return a table; the current behavior is unsafe. Hence I have used a tabular approach via the synonyms() table above instead.

reedmiller17 commented 3 years ago

Hi, This is really fantastic! Thank you so much for your effort sorting this out.

As I'm digging into it, realizing that there are good deal of species without assessed Vulnerability scores, which might pose a challenge to my method overall...

Best, ~Reed

reedmiller17 commented 3 years ago

Hi again! Returning to this, and now the beginning of the code you shared on August 11th isn't working right; itis has 0 rows and 0 variables. I recently updated my R version (see session info below), so perhaps that's the issue? I'm running rfishbase3.1.9. Thanks again, ~Reed

library(rfishbase)
library(tidyverse)
library(taxalight)

sp <- read_tsv("https://github.com/ropensci/rfishbase/files/6963958/FAO.FishStat.Marine.specieslist.txt")
#> Rows: 2197 Columns: 9
#> -- Column specification --------------------------------------------------------
#> Delimiter: "\t"
#> chr (8): ISSCAAP_DIVISION, ISSCAAP_GROUP, FAOSTAT_GROUP_OF_SPECIES, CPC_CLAS...
#> dbl (1): COUNT
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.

nrow(sp) # 2197 names
#> [1] 2197

## Let's use taxalight to expand this list to include synonyms recognized by ITIS
itis <- tl(sp$SCIENTIFIC_NAME, "itis") %>% as_tibble()

dim(itis)[1]
#> [1] 0
## But how many unique taxa do these names correspond to?
itis %>% select(acceptedNameUsageID) %>% distinct() %>% nrow()
#> Error: Can't subset columns that don't exist.
#> x Column `acceptedNameUsageID` doesn't exist.
# 1772

## Lets resolve all matched names in ITIS against SLB's synonym table directly instead:
slb_taxa <- synonyms(server="sealifebase") 
nrow(slb_taxa)
#> [1] 166799
slb_resolved <- itis %>% left_join(slb_taxa, by = c("scientificName"="synonym"))
#> Error: Join columns must be present in data.
#> x Problem with `scientificName`.

here is the session info:

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reprex_2.0.1

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13 knitr_1.35      magrittr_2.0.1  R6_2.5.1        rlang_0.4.11   
 [6] fastmap_1.1.0   fansi_0.5.0     highr_0.9       tools_4.1.1     xfun_0.26      
[11] utf8_1.2.2      cli_3.0.1       clipr_0.7.1     withr_2.4.2     htmltools_0.5.2
[16] ellipsis_0.3.2  yaml_2.2.1      digest_0.6.28   tibble_3.1.4    lifecycle_1.0.1
[21] crayon_1.4.1    processx_3.5.2  callr_3.7.0     vctrs_0.3.8     fs_1.5.0       
[26] ps_1.6.0        glue_1.4.2      evaluate_0.14   rmarkdown_2.11  compiler_4.1.1 
[31] pillar_1.6.3    pkgconfig_2.0.3
cboettig commented 3 years ago

@reedmiller17 that's definitely unexpected! I can't reproduce this, the above still works for me.

Let's try purging the taxalight database and starting fresh:

fs::dir_delete(taxalight::tl_dir())
taxalight::tl_import("itis")

and try your script again?