iDigBio / research-project-ideas

Project ideas and discussion for research using iDigBio data and resources
MIT License
4 stars 0 forks source link

Look for divergent BIN / Locality / Species in BOLD datasets #12

Open godfoder opened 6 years ago

godfoder commented 6 years ago

Per Claude Nozeres (@claudenozeres): There are probably some good use cases for mining the BOLD dataset looking for disjoin groups using clustering or other algorithms.

Exemplar: A well studied species of fish had two clearly distinct BINs, one clustered in north america, and one in Europe. It should be possible to periodically look for these cases programmatically, and possibly to identifier either bold, or the collection holders.

claudenozeres commented 6 years ago

Here is an example, am still exploring the extent, but manually so far, on web or in R with BOLD package.

  1. Get species (download all, or by phyla if too big, like Arthropoda ;o).
  2. Compare & contrast public data specimens for their a) species name submitted with their b) BIN (BOLD Identification Number cluster). Would expect 1:1 relation. Obviously, many will not match. Sometimes mistakes in species names is submitted. This can now be explored: 3) what are the majority species names per BIN? Likely to be the correct name. This is the 'several names to 1 BIN' analysis. 4) Conversely, can sometime reveal cryptic species. This is the 'several BINs to 1 name' analysis.

Try here with this simplest example: Myxine glutinosa, http://www.boldsystems.org/index.php/Public_SearchTerms

Then I manually review WoRMS/OBIS/GBIF/BHL/EOL to find out what is reported as other names and their distribution. In this case, a forgotten, but valid name appears in the literature (Myxine limosa).

There are several fishes, and dozens on invertebrates like this I've noticed in manual searches and browsing the records (in Excel--but no more, now to use new tools!).