ropensci / rfishbase

R interface to the fishbase.org database
https://docs.ropensci.org/rfishbase
111 stars 40 forks source link

validate_names returns array of suggestions errantly #38

Closed sartonic closed 9 years ago

sartonic commented 9 years ago

I am using validate_names on a list of over 1000 scientific names, which are mostly correct, but may have small spelling errors occasionally. I've included a subset of the species below.

> species = c("Ablennes hians", "Acanthopagrus schlegeli", "Acanthopagrus berda", "Auxis thazard", "Auxis rochei")

Of these species, I can (after-the-fact) verify that A. schegeli, A. thazard, and A. rochei are misspelled. However, when I run validate_names, several interesting things happen...

> validate_names(species)

[[1]]
[1] "Ablennes hians"

[[2]]
[1] "Acanthopagrus schlegelii schlegelii"

[[3]]
[1] "Acanthopagrus berda"     "Acanthopagrus vagus"     "Acanthopagrus pacificus"

[[4]]
[1] "Auxis rochei rochei"   "Auxis thazard thazard"

[[5]]
[1] "Auxis rochei rochei"

The function returns an array of arrays. Within that... 1) A. hians had no spelling issues, and was returned correctly; 2) A. schlegelii had a spelling issue and returned fixed; 3) A. berda had no spelling issues, but an array of alternatives were also returned; 4) A. thazard was misspelled, and was returned fixed, but placed second in an array of alternatives; 5) A. rochei was misspelled, and was returned fixed.

It seems like there are two issues here. The first is that even when there is an exact string match, an array of suggestions is returned (#3), when I would have thought that just the original string would be returned. Is this intentional? Second, when a species names does need to be corrected, the closest match isn't always returned first in the array of suggestions (#4).

For me, ideally only the best match would be returned, and at the very least the best match would be returned first in an array of alternatives. This is because, now, if I use this array of arrays as an input to another function, like species_info, only the first entry of each array is used as the species name parameter. This leads to duplication issues, and left out A. thazard entirely, as you can see below.

> speciesList = validate_names(species)
> species_info(speciesList, fields = c("SpecCode", "Genus", "Species", "FBname"))

Source: local data frame [5 x 4]

  SpecCode         Genus               Species             FBname
1      972      Ablennes                 hians    Flat needlefish
2     6531 Acanthopagrus schlegelii schlegelii Blackhead seabream
3     5526 Acanthopagrus                 berda  Goldsilk seabream
4       93         Auxis         rochei rochei        Bullet tuna
5       93         Auxis         rochei rochei        Bullet tuna
cboettig commented 9 years ago

Thanks, sounds like a bug. Can you take a look at the tables returned by synonyms() instead? It will have more information about the matches and what name FishBase considers valid. I'll take a closer look when I get a chance

On Tue, Mar 31, 2015, 6:16 PM sartonic notifications@github.com wrote:

I am using validate_names on a list of over 1000 scientific names, which are mostly correct, but may have small spelling errors occasionally. I've included a subset of the species below.

species = c("Ablennes hians", "Acanthopagrus schlegeli", "Acanthopagrus berda", "Auxis thazard", "Auxis rochei")

Of these species, I can (after-the-fact) verify that A. schegeli, A. thazard, and A. rochei are misspelled. However, when I run validate_names, several interesting things happen...

validate_names(species)

[[1]] [1] "Ablennes hians"

[[2]] [1] "Acanthopagrus schlegelii schlegelii"

[[3]] [1] "Acanthopagrus berda" "Acanthopagrus vagus" "Acanthopagrus pacificus"

[[4]] [1] "Auxis rochei rochei" "Auxis thazard thazard"

[[5]] [1] "Auxis rochei rochei"

The function returns an array of arrays. Within that... 1) A. hians had no spelling issues, and was returned correctly; 2) A. schlegelii had a spelling issue and returned fixed; 3) A. berda had no spelling issues, but an array of alternatives were also returned; 4) A. thazard was misspelled, and was returned fixed, but placed second in an array of alternatives; 5) A. rochei was misspelled, and was returned fixed.

It seems like there are two issues here. The first is that even when there is an exact string match, an array of suggestions is returned (#3 https://github.com/ropensci/rfishbase/issues/3), when I would have thought that just the original string would be returned. Is this intentional? Second, when a species names does need to be corrected, the closest match isn't always returned first in the array of suggestions (#4 https://github.com/ropensci/rfishbase/issues/4).

For me, ideally only the best match would be returned, and at the very least the best match would be returned first in an array of alternatives. This is because, now, if I use this array of arrays as an input to another function, like species_info, only the first entry of each array is used as the species name parameter. This leads to duplication issues, and left out A. thazard entirely, as you can see below.

speciesList = validate_names(species) species_info(speciesList, fields = c("SpecCode", "Genus", "Species", "FBname"))

Source: local data frame [5 x 4]

SpecCode Genus Species FBname 1 972 Ablennes hians Flat needlefish 2 6531 Acanthopagrus schlegelii schlegelii Blackhead seabream 3 5526 Acanthopagrus berda Goldsilk seabream 4 93 Auxis rochei rochei Bullet tuna 5 93 Auxis rochei rochei Bullet tuna

— Reply to this email directly or view it on GitHub https://github.com/ropensci/rfishbase/issues/38.

sartonic commented 9 years ago

Thanks! So running synonyms on that set of species looks like this:

> synonyms(species)

Source: local data frame [8 x 11]

       SynGenus SynSpecies Valid Misspelling        Synonymy     Combination SpecCode SynCode CoL_ID    TSN WoRMS_ID
1      Ablennes      hians  TRUE       FALSE  senior synonym new combination      972   22975     NA 165548   159246
2 Acanthopagrus  schlegeli FALSE        TRUE  senior synonym new combination     6531   53924     NA     NA   401896
3 Acanthopagrus      berda  TRUE       FALSE  senior synonym new combination     5526   25680     NA 647906   218588
4 Acanthopagrus      berda FALSE       FALSE misapplied name      misapplied    65558  164245     NA     NA       NA
5 Acanthopagrus      berda FALSE       FALSE misapplied name      misapplied    65896  164860     NA     NA       NA
6         Auxis    thazard FALSE       FALSE misapplied name      misapplied       93   10117     NA     NA       NA
7         Auxis    thazard FALSE       FALSE  senior synonym new combination       94   22739     NA 172456   127016
8         Auxis     rochei FALSE       FALSE  senior synonym new combination       93   22738     NA 172455   127015
cboettig commented 9 years ago

@sartonic Thanks, I've just pushed a fix. you should now see:

> validate_names(species)
[1] "Ablennes hians"                      "Acanthopagrus schlegelii schlegelii"
[3] "Acanthopagrus berda"                 "Auxis thazard thazard"              
[5] "Auxis rochei rochei"                
Warning messages:
1: FishBase says that 'Acanthopagrus berda' can also be misapplied to other species
                    but is returning only the best match.  
                    See synonyms('Acanthopagrus berda') for details 
2: FishBase says that 'Auxis thazard' can also be misapplied to other species
                    but is returning only the best match.  
                    See synonyms('Auxis thazard') for details 

So the problem here is that FishBase recognizes certain names as being "misapplied", as you see in the syn table. For instance, FishBase is telling us that A. thazard is sometimes misapplied to the species that it recognizes as Auxis rochei rochei, as well as being correctly applied to the species it recognizes as Auxis thazard thazard (FishBase.org does not believe in treating subspecies a separate taxonomic level).

So obviously this is not something that we can completely solve with code, since some people use the name Auxis thazard to refer to what FishBase considers Auxis thazard thazard, but others use that very same name to refer to what FishBase considers to be Auxis rochei rochei.

As you see, I've opted for just ignoring misapplied names, which should return a nice character vector instead. But one could argue that this is not the best behavior. In the original list you saw, FishBase is saying Acanthopagrus berda could mean any one of those three species listed.

This is why catalogs use code numbers instead of names -- if you asked for "The species with Fishbase SpecCode 94", or equivalently, the species with "TSN code 172456" or "WoRMS_ID 12706", there would be no ambiguity, even though the species may go by different latin names to different people or even different Latin names in the different databases.

Anyway, let me know if you think the above fix and the associated warning message is reasonable. Thanks!

sartonic commented 9 years ago

I think that is a great fix, thank you very much!

I just tried it on my dataset, and I will just mention that the output is still an array of arrays, because when it can't find a species name in the fishbase database, it returns an empty array. This isn't an issue for me, since other functions like species_info() will just ignore those empty arrays, but just wanted to point it out in case you think it might be problematic for other reasons.

Thanks!

cboettig commented 9 years ago

Ah, good point. For consistency's sake I've pushed a change that should just flatten the list and drop missing entries. It will also warn when it cannot match a species.

I don't really like chatty warning messages but it seems worse to silently return only m<n names when given n names to validate (and there's always supressWarnings() for the annoyed user)

feedback welcome as always, as you see the design is pretty rough here still.

Cheers,

On Wed, Apr 1, 2015, 11:03 AM sartonic notifications@github.com wrote:

I think that is a great fix, thank you very much!

I just tried it on my dataset, and I will just mention that the output is still an array of arrays, because when it can't find a species name in the fishbase database, it returns an empty array. This isn't an issue for me, since other functions like species_info() will just ignore those empty arrays, but just wanted to point it out in case you think it might be problematic for other reasons.

Thanks!

Reply to this email directly or view it on GitHub https://github.com/ropensci/rfishbase/issues/38#issuecomment-88578222.

sartonic commented 9 years ago

Yes, I think it is good to clarify whenever the output m is less than the input n. I actually have a tiny follow-up question based on that same idea.

Let's say I run validate_names on a list of 864 scientific names and the output of validate_names is 667. Running the validated names through species_info() for fields = c("SpecCode", "Genus", "Species", "FBname") returns 667 entries, but running the validated names through ecology() for fields=c("SpecCode", "FoodTroph", "FoodSeTroph", "DietTroph", "DietSeTroph") returns only 628 entries. What accounts for that difference? I would have thought perhaps ecology() was excluding species for which all of the fields (except SpecCode) were NA, but there are a few examples in the output where there are NAs for each field.

Also I just downloaded and reinstalled rfishbase to get the latest push and now the function seems to be broken:

> validatedSpecies = validate_names(mySpecies)
Error in function (type, msg, asError = TRUE)  : 
  Failed connect to fishbase.ropensci.org:80; Connection refused
In addition: There were 50 or more warnings (use warnings() to see the first 50)
> testValidate = validate_names("Ablennes  hians")
Warning messages:
1: In check_and_parse(resp) : server error: (502) Bad Gateway
2: In error_checks(parsed, resp = resp) :
  Failed to parse or empty query results for http://fishbase.ropensci.org/synonyms?SynSpecies=%20hians&SynGenus=Ablennes&limit=50&fields=SynGenus%2CSynSpecies%2CValid%2CMisspelling%2CColStatus%2CSynonymy%2CCombination%2CSpecCode%2CSynCode%2CCoL_ID%2CTSN%2CWoRMS_ID
3: No match found for species 'Ablennes  hians' 
cboettig commented 9 years ago

Re: ecology() vs species_info(), every species FishBase knows about has an entry in the species_info() table (it's somewhat analogous to the summary/home-page you see for the species on the website); but not every species has an entry in the ecology() table -- for some species fishbase just doesn't have trophic ecology information. On the website you'll see a list of tables on every summary page that is called "additional information", and all missing tables are just grayed out.

ecology() is pretty good; other tables are even more sparse.

Re: the connection being broken -- that's not actually anything to do with the recent rfishbase commit. As I mentioned, this version is still in development and that was due to me restarting the web server that provides the fishbase API (hence "connection refused" error -- I know it's pretty opaque, automated errors always are). You can use the functions heartbeat() or ping() to see if the server is alive. The server will be more stable once we have a stable release for CRAN, but for now we're still fiddling with it to add more features and endpoints, etc.

On Wed, Apr 1, 2015 at 1:06 PM sartonic notifications@github.com wrote:

Yes, I think it is good to clarify whenever the output m is less than the input n. I actually have a tiny follow-up question based on that same question.

Let's say I run validate_names on a list of 864 scientific names and the output of validate_names is 667. Running the validated names through species_info() returns 667 entries, but running the validated names through ecology() for fields=c("SpecCode", "FoodTroph", "FoodSeTroph", "DietTroph", "DietSeTroph") returns only 628 entries. What accounts for that difference? I would have thought perhaps ecology() was excluding species for which all of the fields (except SpecCode) were NA, but there are a few examples in the output where there are NAs for each field.

Also I just downloaded and reinstalled rfishbase to get the latest push and now the function seems to be broken:

validatedSpecies = validate_names(mySpecies) Error in function (type, msg, asError = TRUE) : Failed connect to fishbase.ropensci.org:80; Connection refused In addition: There were 50 or more warnings (use warnings() to see the first 50)

testValidate = validate_names("Ablennes hians") Warning messages: 1: In check_and_parse(resp) : server error: (502) Bad Gateway 2: In error_checks(parsed, resp = resp) : Failed to parse or empty query results for http://fishbase.ropensci.org/synonyms?SynSpecies=%20hians&SynGenus=Ablennes&limit=50&fields=SynGenus%2CSynSpecies%2CValid%2CMisspelling%2CColStatus%2CSynonymy%2CCombination%2CSpecCode%2CSynCode%2CCoL_ID%2CTSN%2CWoRMS_ID 3: No match found for species 'Ablennes hians'

— Reply to this email directly or view it on GitHub https://github.com/ropensci/rfishbase/issues/38#issuecomment-88616189.

cboettig commented 9 years ago

p.s. the server should be back up now ;-)

On Wed, Apr 1, 2015 at 1:13 PM Carl Boettiger cboettig@gmail.com wrote:

Re: ecology() vs species_info(), every species FishBase knows about has an entry in the species_info() table (it's somewhat analogous to the summary/home-page you see for the species on the website); but not every species has an entry in the ecology() table -- for some species fishbase just doesn't have trophic ecology information. On the website you'll see a list of tables on every summary page that is called "additional information", and all missing tables are just grayed out.

ecology() is pretty good; other tables are even more sparse.

Re: the connection being broken -- that's not actually anything to do with the recent rfishbase commit. As I mentioned, this version is still in development and that was due to me restarting the web server that provides the fishbase API (hence "connection refused" error -- I know it's pretty opaque, automated errors always are). You can use the functions heartbeat() or ping() to see if the server is alive. The server will be more stable once we have a stable release for CRAN, but for now we're still fiddling with it to add more features and endpoints, etc.

On Wed, Apr 1, 2015 at 1:06 PM sartonic notifications@github.com wrote:

Yes, I think it is good to clarify whenever the output m is less than the input n. I actually have a tiny follow-up question based on that same question.

Let's say I run validate_names on a list of 864 scientific names and the output of validate_names is 667. Running the validated names through species_info() returns 667 entries, but running the validated names through ecology() for fields=c("SpecCode", "FoodTroph", "FoodSeTroph", "DietTroph", "DietSeTroph") returns only 628 entries. What accounts for that difference? I would have thought perhaps ecology() was excluding species for which all of the fields (except SpecCode) were NA, but there are a few examples in the output where there are NAs for each field.

Also I just downloaded and reinstalled rfishbase to get the latest push and now the function seems to be broken:

validatedSpecies = validate_names(mySpecies) Error in function (type, msg, asError = TRUE) : Failed connect to fishbase.ropensci.org:80; Connection refused In addition: There were 50 or more warnings (use warnings() to see the first 50)

testValidate = validate_names("Ablennes hians") Warning messages: 1: In check_and_parse(resp) : server error: (502) Bad Gateway 2: In error_checks(parsed, resp = resp) : Failed to parse or empty query results for http://fishbase.ropensci.org/synonyms?SynSpecies=%20hians&SynGenus=Ablennes&limit=50&fields=SynGenus%2CSynSpecies%2CValid%2CMisspelling%2CColStatus%2CSynonymy%2CCombination%2CSpecCode%2CSynCode%2CCoL_ID%2CTSN%2CWoRMS_ID 3: No match found for species 'Ablennes hians'

— Reply to this email directly or view it on GitHub https://github.com/ropensci/rfishbase/issues/38#issuecomment-88616189.

sartonic commented 9 years ago

Oh that makes sense, thanks!

cboettig commented 9 years ago

@sartonic Sounds like this is all resolved now, so I'll close this issue. Feel free to open any new issues if there's anything you'd like to see added or changed!