Closed sartonic closed 9 years ago
Thanks, sounds like a bug. Can you take a look at the tables returned by synonyms() instead? It will have more information about the matches and what name FishBase considers valid. I'll take a closer look when I get a chance
On Tue, Mar 31, 2015, 6:16 PM sartonic notifications@github.com wrote:
I am using validate_names on a list of over 1000 scientific names, which are mostly correct, but may have small spelling errors occasionally. I've included a subset of the species below.
species = c("Ablennes hians", "Acanthopagrus schlegeli", "Acanthopagrus berda", "Auxis thazard", "Auxis rochei")
Of these species, I can (after-the-fact) verify that A. schegeli, A. thazard, and A. rochei are misspelled. However, when I run validate_names, several interesting things happen...
validate_names(species)
[[1]] [1] "Ablennes hians"
[[2]] [1] "Acanthopagrus schlegelii schlegelii"
[[3]] [1] "Acanthopagrus berda" "Acanthopagrus vagus" "Acanthopagrus pacificus"
[[4]] [1] "Auxis rochei rochei" "Auxis thazard thazard"
[[5]] [1] "Auxis rochei rochei"
The function returns an array of arrays. Within that... 1) A. hians had no spelling issues, and was returned correctly; 2) A. schlegelii had a spelling issue and returned fixed; 3) A. berda had no spelling issues, but an array of alternatives were also returned; 4) A. thazard was misspelled, and was returned fixed, but placed second in an array of alternatives; 5) A. rochei was misspelled, and was returned fixed.
It seems like there are two issues here. The first is that even when there is an exact string match, an array of suggestions is returned (#3 https://github.com/ropensci/rfishbase/issues/3), when I would have thought that just the original string would be returned. Is this intentional? Second, when a species names does need to be corrected, the closest match isn't always returned first in the array of suggestions (#4 https://github.com/ropensci/rfishbase/issues/4).
For me, ideally only the best match would be returned, and at the very least the best match would be returned first in an array of alternatives. This is because, now, if I use this array of arrays as an input to another function, like species_info, only the first entry of each array is used as the species name parameter. This leads to duplication issues, and left out A. thazard entirely, as you can see below.
speciesList = validate_names(species) species_info(speciesList, fields = c("SpecCode", "Genus", "Species", "FBname"))
Source: local data frame [5 x 4]
SpecCode Genus Species FBname 1 972 Ablennes hians Flat needlefish 2 6531 Acanthopagrus schlegelii schlegelii Blackhead seabream 3 5526 Acanthopagrus berda Goldsilk seabream 4 93 Auxis rochei rochei Bullet tuna 5 93 Auxis rochei rochei Bullet tuna
— Reply to this email directly or view it on GitHub https://github.com/ropensci/rfishbase/issues/38.
Thanks! So running synonyms on that set of species looks like this:
> synonyms(species)
Source: local data frame [8 x 11]
SynGenus SynSpecies Valid Misspelling Synonymy Combination SpecCode SynCode CoL_ID TSN WoRMS_ID
1 Ablennes hians TRUE FALSE senior synonym new combination 972 22975 NA 165548 159246
2 Acanthopagrus schlegeli FALSE TRUE senior synonym new combination 6531 53924 NA NA 401896
3 Acanthopagrus berda TRUE FALSE senior synonym new combination 5526 25680 NA 647906 218588
4 Acanthopagrus berda FALSE FALSE misapplied name misapplied 65558 164245 NA NA NA
5 Acanthopagrus berda FALSE FALSE misapplied name misapplied 65896 164860 NA NA NA
6 Auxis thazard FALSE FALSE misapplied name misapplied 93 10117 NA NA NA
7 Auxis thazard FALSE FALSE senior synonym new combination 94 22739 NA 172456 127016
8 Auxis rochei FALSE FALSE senior synonym new combination 93 22738 NA 172455 127015
@sartonic Thanks, I've just pushed a fix. you should now see:
> validate_names(species)
[1] "Ablennes hians" "Acanthopagrus schlegelii schlegelii"
[3] "Acanthopagrus berda" "Auxis thazard thazard"
[5] "Auxis rochei rochei"
Warning messages:
1: FishBase says that 'Acanthopagrus berda' can also be misapplied to other species
but is returning only the best match.
See synonyms('Acanthopagrus berda') for details
2: FishBase says that 'Auxis thazard' can also be misapplied to other species
but is returning only the best match.
See synonyms('Auxis thazard') for details
So the problem here is that FishBase recognizes certain names as being "misapplied", as you see in the syn table. For instance, FishBase is telling us that A. thazard
is sometimes misapplied to the species that it recognizes as Auxis rochei rochei
, as well as being correctly applied to the species it recognizes as Auxis thazard thazard
(FishBase.org does not believe in treating subspecies a separate taxonomic level).
So obviously this is not something that we can completely solve with code, since some people use the name Auxis thazard
to refer to what FishBase considers Auxis thazard thazard
, but others use that very same name to refer to what FishBase considers to be Auxis rochei rochei
.
As you see, I've opted for just ignoring misapplied names, which should return a nice character vector instead. But one could argue that this is not the best behavior. In the original list you saw, FishBase is saying Acanthopagrus berda
could mean any one of those three species listed.
This is why catalogs use code numbers instead of names -- if you asked for "The species with Fishbase SpecCode 94", or equivalently, the species with "TSN code 172456" or "WoRMS_ID 12706", there would be no ambiguity, even though the species may go by different latin names to different people or even different Latin names in the different databases.
Anyway, let me know if you think the above fix and the associated warning message is reasonable. Thanks!
I think that is a great fix, thank you very much!
I just tried it on my dataset, and I will just mention that the output is still an array of arrays, because when it can't find a species name in the fishbase database, it returns an empty array. This isn't an issue for me, since other functions like species_info() will just ignore those empty arrays, but just wanted to point it out in case you think it might be problematic for other reasons.
Thanks!
Ah, good point. For consistency's sake I've pushed a change that should just flatten the list and drop missing entries. It will also warn when it cannot match a species.
I don't really like chatty warning messages but it seems worse to silently return only m<n names when given n names to validate (and there's always supressWarnings() for the annoyed user)
feedback welcome as always, as you see the design is pretty rough here still.
Cheers,
On Wed, Apr 1, 2015, 11:03 AM sartonic notifications@github.com wrote:
I think that is a great fix, thank you very much!
I just tried it on my dataset, and I will just mention that the output is still an array of arrays, because when it can't find a species name in the fishbase database, it returns an empty array. This isn't an issue for me, since other functions like species_info() will just ignore those empty arrays, but just wanted to point it out in case you think it might be problematic for other reasons.
Thanks!
—
Reply to this email directly or view it on GitHub https://github.com/ropensci/rfishbase/issues/38#issuecomment-88578222.
Yes, I think it is good to clarify whenever the output m is less than the input n. I actually have a tiny follow-up question based on that same idea.
Let's say I run validate_names on a list of 864 scientific names and the output of validate_names is 667. Running the validated names through species_info() for fields = c("SpecCode", "Genus", "Species", "FBname")
returns 667 entries, but running the validated names through ecology() for fields=c("SpecCode", "FoodTroph", "FoodSeTroph", "DietTroph", "DietSeTroph")
returns only 628 entries. What accounts for that difference? I would have thought perhaps ecology() was excluding species for which all of the fields (except SpecCode) were NA, but there are a few examples in the output where there are NAs for each field.
Also I just downloaded and reinstalled rfishbase to get the latest push and now the function seems to be broken:
> validatedSpecies = validate_names(mySpecies)
Error in function (type, msg, asError = TRUE) :
Failed connect to fishbase.ropensci.org:80; Connection refused
In addition: There were 50 or more warnings (use warnings() to see the first 50)
> testValidate = validate_names("Ablennes hians")
Warning messages:
1: In check_and_parse(resp) : server error: (502) Bad Gateway
2: In error_checks(parsed, resp = resp) :
Failed to parse or empty query results for http://fishbase.ropensci.org/synonyms?SynSpecies=%20hians&SynGenus=Ablennes&limit=50&fields=SynGenus%2CSynSpecies%2CValid%2CMisspelling%2CColStatus%2CSynonymy%2CCombination%2CSpecCode%2CSynCode%2CCoL_ID%2CTSN%2CWoRMS_ID
3: No match found for species 'Ablennes hians'
Re: ecology() vs species_info(), every species FishBase knows about has an entry in the species_info() table (it's somewhat analogous to the summary/home-page you see for the species on the website); but not every species has an entry in the ecology() table -- for some species fishbase just doesn't have trophic ecology information. On the website you'll see a list of tables on every summary page that is called "additional information", and all missing tables are just grayed out.
ecology() is pretty good; other tables are even more sparse.
Re: the connection being broken -- that's not actually anything to do with the recent rfishbase commit. As I mentioned, this version is still in development and that was due to me restarting the web server that provides the fishbase API (hence "connection refused" error -- I know it's pretty opaque, automated errors always are). You can use the functions heartbeat() or ping() to see if the server is alive. The server will be more stable once we have a stable release for CRAN, but for now we're still fiddling with it to add more features and endpoints, etc.
On Wed, Apr 1, 2015 at 1:06 PM sartonic notifications@github.com wrote:
Yes, I think it is good to clarify whenever the output m is less than the input n. I actually have a tiny follow-up question based on that same question.
Let's say I run validate_names on a list of 864 scientific names and the output of validate_names is 667. Running the validated names through species_info() returns 667 entries, but running the validated names through ecology() for fields=c("SpecCode", "FoodTroph", "FoodSeTroph", "DietTroph", "DietSeTroph") returns only 628 entries. What accounts for that difference? I would have thought perhaps ecology() was excluding species for which all of the fields (except SpecCode) were NA, but there are a few examples in the output where there are NAs for each field.
Also I just downloaded and reinstalled rfishbase to get the latest push and now the function seems to be broken:
validatedSpecies = validate_names(mySpecies) Error in function (type, msg, asError = TRUE) : Failed connect to fishbase.ropensci.org:80; Connection refused In addition: There were 50 or more warnings (use warnings() to see the first 50)
testValidate = validate_names("Ablennes hians") Warning messages: 1: In check_and_parse(resp) : server error: (502) Bad Gateway 2: In error_checks(parsed, resp = resp) : Failed to parse or empty query results for http://fishbase.ropensci.org/synonyms?SynSpecies=%20hians&SynGenus=Ablennes&limit=50&fields=SynGenus%2CSynSpecies%2CValid%2CMisspelling%2CColStatus%2CSynonymy%2CCombination%2CSpecCode%2CSynCode%2CCoL_ID%2CTSN%2CWoRMS_ID 3: No match found for species 'Ablennes hians'
— Reply to this email directly or view it on GitHub https://github.com/ropensci/rfishbase/issues/38#issuecomment-88616189.
p.s. the server should be back up now ;-)
On Wed, Apr 1, 2015 at 1:13 PM Carl Boettiger cboettig@gmail.com wrote:
Re: ecology() vs species_info(), every species FishBase knows about has an entry in the species_info() table (it's somewhat analogous to the summary/home-page you see for the species on the website); but not every species has an entry in the ecology() table -- for some species fishbase just doesn't have trophic ecology information. On the website you'll see a list of tables on every summary page that is called "additional information", and all missing tables are just grayed out.
ecology() is pretty good; other tables are even more sparse.
Re: the connection being broken -- that's not actually anything to do with the recent rfishbase commit. As I mentioned, this version is still in development and that was due to me restarting the web server that provides the fishbase API (hence "connection refused" error -- I know it's pretty opaque, automated errors always are). You can use the functions heartbeat() or ping() to see if the server is alive. The server will be more stable once we have a stable release for CRAN, but for now we're still fiddling with it to add more features and endpoints, etc.
On Wed, Apr 1, 2015 at 1:06 PM sartonic notifications@github.com wrote:
Yes, I think it is good to clarify whenever the output m is less than the input n. I actually have a tiny follow-up question based on that same question.
Let's say I run validate_names on a list of 864 scientific names and the output of validate_names is 667. Running the validated names through species_info() returns 667 entries, but running the validated names through ecology() for fields=c("SpecCode", "FoodTroph", "FoodSeTroph", "DietTroph", "DietSeTroph") returns only 628 entries. What accounts for that difference? I would have thought perhaps ecology() was excluding species for which all of the fields (except SpecCode) were NA, but there are a few examples in the output where there are NAs for each field.
Also I just downloaded and reinstalled rfishbase to get the latest push and now the function seems to be broken:
validatedSpecies = validate_names(mySpecies) Error in function (type, msg, asError = TRUE) : Failed connect to fishbase.ropensci.org:80; Connection refused In addition: There were 50 or more warnings (use warnings() to see the first 50)
testValidate = validate_names("Ablennes hians") Warning messages: 1: In check_and_parse(resp) : server error: (502) Bad Gateway 2: In error_checks(parsed, resp = resp) : Failed to parse or empty query results for http://fishbase.ropensci.org/synonyms?SynSpecies=%20hians&SynGenus=Ablennes&limit=50&fields=SynGenus%2CSynSpecies%2CValid%2CMisspelling%2CColStatus%2CSynonymy%2CCombination%2CSpecCode%2CSynCode%2CCoL_ID%2CTSN%2CWoRMS_ID 3: No match found for species 'Ablennes hians'
— Reply to this email directly or view it on GitHub https://github.com/ropensci/rfishbase/issues/38#issuecomment-88616189.
Oh that makes sense, thanks!
@sartonic Sounds like this is all resolved now, so I'll close this issue. Feel free to open any new issues if there's anything you'd like to see added or changed!
I am using validate_names on a list of over 1000 scientific names, which are mostly correct, but may have small spelling errors occasionally. I've included a subset of the species below.
Of these species, I can (after-the-fact) verify that A. schegeli, A. thazard, and A. rochei are misspelled. However, when I run validate_names, several interesting things happen...
The function returns an array of arrays. Within that... 1) A. hians had no spelling issues, and was returned correctly; 2) A. schlegelii had a spelling issue and returned fixed; 3) A. berda had no spelling issues, but an array of alternatives were also returned; 4) A. thazard was misspelled, and was returned fixed, but placed second in an array of alternatives; 5) A. rochei was misspelled, and was returned fixed.
It seems like there are two issues here. The first is that even when there is an exact string match, an array of suggestions is returned (#3), when I would have thought that just the original string would be returned. Is this intentional? Second, when a species names does need to be corrected, the closest match isn't always returned first in the array of suggestions (#4).
For me, ideally only the best match would be returned, and at the very least the best match would be returned first in an array of alternatives. This is because, now, if I use this array of arrays as an input to another function, like species_info, only the first entry of each array is used as the species name parameter. This leads to duplication issues, and left out A. thazard entirely, as you can see below.