schwilklab / taxon-name-utils

Code and data for plant name synonym expansion and name matching
MIT License
4 stars 0 forks source link

What name cleaning was done on GBIF names list? #5

Closed dschwilk closed 10 years ago

dschwilk commented 10 years ago

@willpearse: I have a question about the gbif-occurrences-names.txt file. I see that it has only strictly binomial names (with two exceptions using "cv.", see #4). Is this really what the gbif "backbone taxonomy" list looks like? Are these exact matches to keys? I ask because see this snippet of an email from Jan Legind regarding the recent query from the plants and fire group:

"For instance, the DB has the species name Abelia chinensis, but not Abelia chinensis var. ionandra. However, using the taxonomy backbone lookup we can resolve the name to 'Abelia chinensis ionandra R. Br.' which appears in the database."

Is he saying that Abelia chinensis ionandra R. Br. should be in the list? It seems so. But that name and others of that form are not in the names file you pushed.

dschwilk commented 10 years ago

Before I move any further, I want to get this sorted and make sure we have a good and complete GBIF names list. I was about to look into this before we joined forces, so this is great, but I don't know much about it! I've been reading the gbif developer blog and poking around the web apis.

willpearse commented 10 years ago

This is perhaps not the best way to share this, but nothing more complex than this:

counts = collections.Counter() with open("/home/will/Downloads/occurrence.txt", "r") as handle: for line in handle: line = line.split("\t") try: counts[line[223]] += 1 except: print "!!!Badly formed line at GBIFID", line[0], "ignoring..."

The last line is 'badly formed' but I hardly think that's an issue...

...also kill the empty species (?...?)

del counts[""]

...i.e., very little cleaning over what the raw data dump of the occurrences dataset looks like (which is a lot of Gb.). For our purposes, we were more interested in working with the dump of the occurrences because we're counting the number of GBIF records (long story).

Does this help?

W

On 06/03/2014 03:48 PM, Dylan Schwilk wrote:

Reopened #5 https://github.com/schwilklab/taxon-name-utils/issues/5.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/5#event-127664754.

dschwilk commented 10 years ago

Ok, I guess I am just making sure that that data dump is what we want. How did you get that data?

willpearse commented 10 years ago

I downloaded it from GBIF - I just selected all the Plantae records and clicked download. I don't think you need to be logged in for: http://api.gbif.org/v0.9/occurrence/download/request/0003999-140429114108248.zip On 06/03/2014 05:02 PM, Dylan Schwilk wrote:

Ok, I guess I am just making sure that that data dump is what we want. How did you get that data?

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/5#issuecomment-45026547.

willpearse commented 10 years ago

...and you can see the search list here:

http://www.gbif.org/occurrence/search?TAXON_KEY=6

On 06/03/2014 05:02 PM, Dylan Schwilk wrote:

Ok, I guess I am just making sure that that data dump is what we want. How did you get that data?

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/5#issuecomment-45026547.

dschwilk commented 10 years ago

D'oh! Ok, thanks. Sorry I'm an idiot today. I was just confused by Jan's comment. I think the problem is that they have also implemented some stuff (some available through the web api) for matching names. But we want to bypass that and that looks like what we are doing so I think all is good. I've been looking at other distance measures, and jaro-winkler distance looks like it can work on specific epithets (it is uselss on full names since it overweights the first part of a string). I'm going to run the full matching script on the tanknames -> gbif names tonight. Should be done by morning.

willpearse commented 10 years ago

No worries whatsoever! I probably should have been more clear to begin with...!

That's fantastic news, thanks for pushing on with this. I'm sorry I've been stuck under an admin pile today; tomorrow, tomorrow!...

W

On 06/03/2014 05:21 PM, Dylan Schwilk wrote:

D'oh! Ok, thanks. Sorry I'm an idiot today. I was just confused by Jan's comment. I think the problem is that they have also implemented some stuff (some available through the web api) for matching names. But we want to bypass that and that looks like what we are doing so I think all is good. I've been looking at other distance measures, and jaro-winkler distance looks like it can work on specific epithets (it is uselss on full names since it overweights the first part of a string). I'm going to run the full matching script on the tanknames -> gbif names tonight. Should be done by morning.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/5#issuecomment-45028203.