Proposed outline of steps for matching database data to a canonical names list

schwilklab / taxon-name-utils

Code and data for plant name synonym expansion and name matching

MIT License

4 stars 0 forks source link

Proposed outline of steps for matching database data to a canonical names list #2

Closed dschwilk closed 7 years ago

dschwilk commented 10 years ago

This might be all obvious, but it helps me to step through things. This is just the overview and does not get into the details and decisions involved in the approximate matching itself (step 2.A).

The problem and nomenclature

One wants to line up trait or distribution data with a set of taxa under study. But we must deal with name synonymy and spelling differences across databases.

The "canonical names list" is the list of taxa for which we want trait data. For example, this list might be all of the taxa in the Tank plant phylogeny. Let's call this list "clist". We have a trait or location database (eg GBIF) and we want to obtain data from that database for every taxon in clist. Let's call the list of keys (taxon names) in that database "dlist".

Note: I am using the term "list" to mean an ordered array of character strings. So this could be a vector of strings, a file with one name per line, etc.

Proposed steps

We can expand clist by including every synonym (according to some synonym table). synonymize.py -a expand does this for The Plant List synonym lookup. I'll call the expanded list "elist".
We want a data record (or set of records) for each name in elist. Many will return no result. But first we need to deal with misspellings and database errors by fuzzy matching, so for each name in dlist, find the best match in elist (or no match). Save a lookup table from dlist to elist names so that each database record can have its dlist name replaced with the elist name. See fuzzy_match_name_list in fuzzy_match.py for an implementation of this. And gbif_lookup.py for an example run.
Now, assume after (2) we have a flat file database in which the keys are the names in elist. All of the records that match the synonyms of a given canonical name can be merged together using synonymize.py -a merge. If the result of (2) above is a flat file, then simply replace the name column (elist) with the results of synonymize.py -a merge -c clist_file elist_file. The result will be a list of the same length as elist but with repeated names (same number of unique names as in clist).

willpearse commented 10 years ago

This is very thorough, thanks for this.

I've already written something that does quite a lot of this; the thing that's slowing it down is how long it takes to fuzzy-search a big set of names. I think I know how to speed it up (split everything into genera, search the genera (fuzzily) then search within genera) but I haven't gotten around to writing it.

I'll do something in the next few days (perhaps even tonight), then push it up in a new branch. I'm not sure how much of synonymize.py I'll re-use, but I'll re-structure so everything is at least in your style for consistency.

On 05/30/2014 04:27 PM, Dylan Schwilk wrote:

This might be all obvious, but it helps me to step through things. This is just the overview and does not get into the details and decisions involved in the approximate matching itself (step 2.A).
The problem and nomenclature
One wants to line up trait or distribution data with a set of taxa under study. But we must deal with name synonymy and spelling differences across databases.

The "canonical names list" is the list of taxa for which we want trait data. For example, this list might be all of the taxa in the Tank plant phylogeny http://datadryad.org/resource/doi:10.5061/dryad.63q27/3. Let's call this list "clist". We have a trait or location database (eg GBIF) and we want to obtain data from that database for every taxon in clist. Let's call the list of keys (taxon names) in that database "dlist".

Note: I am using the term "list" to mean an ordered array of character strings. So this could be a vector of strings, a file with one name per line, etc.
Proposed steps
1.
We can expand clist by including every synonym (according to some
synonym table). |synonymize.py -a expand| does this for The Plant
List <http://www.theplantlist.org/> synonym lookup. I'll call the
expanded list "elist".
2.
We want a data record (or set of records) for each name in elist.
Many will return no result. But first we need to deal with
misspellings and database errors by fuzzy matching, so

A. For each name in elist, find 0-n approximate matches in dlist.
It is important that each name in dlist is matched only once! So
this may involve some checking depending on the fuzzy matching.
B. Store the indices or keys of each dlist match as a list
associated with each name in elist (if one has access to the
actual database, then store the data itself, if one is creating a
name list to send out as a query, store the name/index/pointer to
the dlist name.
3.
Now, assume we have a flat file database in which the keys are the
names in elist. All of the records that match the synonyms of a
given canonical name can be merged together using |synonymize.py
-a merge|. If the result of 2.B above is a flat file, then simply
replace the name column (elist) with the results of |synonymize.py
-a merge -c clist_file elist_file|. The result will be a list of
the same length as elist but with repeated names (same number of
unique names as in clist).
4.
If we are preparing a query (eg for GBIF), then reduce the
resulting data down to only rows for which there was a record in
dlist and send out those dlist names. Then lookup the data
according to the lookup table created in step 3.
— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2.

dschwilk commented 10 years ago

Yes, 2 step approach is probably best (genera and then within genus). I'm also planning to work on this.

What do you mean by: "I'm not sure how much of synonymize.py I'll re-use"? I'm ok with radical refactoring as well, I'm not too attached -- but I have been thinking of the TPL synonym step as fundamentally distinct -- something that is only applied to the canonical names. It does not need to be done to the other side of the lookup as that would be redundant and mess up reversibility. I'm sure I'll understand more once I see the approximate matching stuff you push.

One note regarding speed:

Each item in dlist should only be used once, so dlist can shrink.

First, exact matches are pulled. This reduces the number of unmatched names in dlist
Then dlist shrinks as fuzzy matches to elist items hit.

Of course, may be negligible in really big dlists (like GBIF)!

Cheers,

Dylan

On 06/02/2014 03:50 PM, Will Pearse wrote:

This is very thorough, thanks for this.

I've already written something that does quite a lot of this; the thing that's slowing it down is how long it takes to fuzzy-search a big set of names. I think I know how to speed it up (split everything into genera, search the genera (fuzzily) then search within genera) but I haven't gotten around to writing it.

I'll do something in the next few days (perhaps even tonight), then push it up in a new branch. I'm not sure how much of synonymize.py I'll re-use, but I'll re-structure so everything is at least in your style for consistency.

On 05/30/2014 04:27 PM, Dylan Schwilk wrote:

This might be all obvious, but it helps me to step through things. This is just the overview and does not get into the details and decisions involved in the approximate matching itself (step 2.A).

The problem and nomenclature

One wants to line up trait or distribution data with a set of taxa under study. But we must deal with name synonymy and spelling differences across databases.

The "canonical names list" is the list of taxa for which we want trait data. For example, this list might be all of the taxa in the Tank plant phylogeny http://datadryad.org/resource/doi:10.5061/dryad.63q27/3. Let's call this list "clist". We have a trait or location database (eg GBIF) and we want to obtain data from that database for every taxon in clist. Let's call the list of keys (taxon names) in that database "dlist".

Note: I am using the term "list" to mean an ordered array of character strings. So this could be a vector of strings, a file with one name per line, etc.

Proposed steps

1.

We can expand clist by including every synonym (according to some synonym table). |synonymize.py -a expand| does this for The Plant List http://www.theplantlist.org/ synonym lookup. I'll call the expanded list "elist".

2.

We want a data record (or set of records) for each name in elist. Many will return no result. But first we need to deal with misspellings and database errors by fuzzy matching, so

A. For each name in elist, find 0-n approximate matches in dlist. It is important that each name in dlist is matched only once! So this may involve some checking depending on the fuzzy matching. B. Store the indices or keys of each dlist match as a list associated with each name in elist (if one has access to the actual database, then store the data itself, if one is creating a name list to send out as a query, store the name/index/pointer to the dlist name.

3.

Now, assume we have a flat file database in which the keys are the names in elist. All of the records that match the synonyms of a given canonical name can be merged together using |synonymize.py -a merge|. If the result of 2.B above is a flat file, then simply replace the name column (elist) with the results of |synonymize.py -a merge -c clist_file elist_file|. The result will be a list of the same length as elist but with repeated names (same number of unique names as in clist).

4.

If we are preparing a query (eg for GBIF), then reduce the resulting data down to only rows for which there was a record in dlist and send out those dlist names. Then lookup the data according to the lookup table created in step 3.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-44889331.

willpearse commented 10 years ago

Sorry, on a re-read it looks like I'll use the output from synonymize.py

so there's no need for concern. Sorry!

I definitely agree that pulling out exact matches first is a good way to go; I also see no harm in pulling out things once we've found them. Maybe for GBIF it won't help too much, but it definitely won't hurt :D

Cheers,

Will

On 06/02/2014 04:10 PM, Dylan Schwilk wrote:

Yes, 2 step approach is probably best (genera and then within genus). I'm also planning to work on this.

What do you mean by: "I'm not sure how much of synonymize.py I'll re-use"? I'm ok with radical refactoring as well, I'm not too attached -- but I have been thinking of the TPL synonym step as fundamentally distinct -- something that is only applied to the canonical names. It does not need to be done to the other side of the lookup as that would be redundant and mess up reversibility. I'm sure I'll understand more once I see the approximate matching stuff you push.

One note regarding speed:

Each item in dlist should only be used once, so dlist can shrink.

First, exact matches are pulled. This reduces the number of unmatched names in dlist

Then dlist shrinks as fuzzy matches to elist items hit.

Of course, may be negligible in really big dlists (like GBIF)!

Cheers,

Dylan

On 06/02/2014 03:50 PM, Will Pearse wrote:

This is very thorough, thanks for this.

I've already written something that does quite a lot of this; the thing that's slowing it down is how long it takes to fuzzy-search a big set of names. I think I know how to speed it up (split everything into genera, search the genera (fuzzily) then search within genera) but I haven't gotten around to writing it.

I'll do something in the next few days (perhaps even tonight), then push it up in a new branch. I'm not sure how much of synonymize.py I'll re-use, but I'll re-structure so everything is at least in your style for consistency.

On 05/30/2014 04:27 PM, Dylan Schwilk wrote:

This might be all obvious, but it helps me to step through things. This is just the overview and does not get into the details and decisions involved in the approximate matching itself (step 2.A).

The problem and nomenclature

One wants to line up trait or distribution data with a set of taxa under study. But we must deal with name synonymy and spelling differences across databases.

The "canonical names list" is the list of taxa for which we want trait data. For example, this list might be all of the taxa in the Tank plant phylogeny http://datadryad.org/resource/doi:10.5061/dryad.63q27/3. Let's call this list "clist". We have a trait or location database (eg GBIF) and we want to obtain data from that database for every taxon in clist. Let's call the list of keys (taxon names) in that database "dlist".

Note: I am using the term "list" to mean an ordered array of character strings. So this could be a vector of strings, a file with one name per line, etc.

Proposed steps

1.

We can expand clist by including every synonym (according to some synonym table). |synonymize.py -a expand| does this for The Plant List http://www.theplantlist.org/ synonym lookup. I'll call the expanded list "elist".

2.

We want a data record (or set of records) for each name in elist. Many will return no result. But first we need to deal with misspellings and database errors by fuzzy matching, so

A. For each name in elist, find 0-n approximate matches in dlist. It is important that each name in dlist is matched only once! So this may involve some checking depending on the fuzzy matching. B. Store the indices or keys of each dlist match as a list associated with each name in elist (if one has access to the actual database, then store the data itself, if one is creating a name list to send out as a query, store the name/index/pointer to the dlist name.

3.

Now, assume we have a flat file database in which the keys are the names in elist. All of the records that match the synonyms of a given canonical name can be merged together using |synonymize.py -a merge|. If the result of 2.B above is a flat file, then simply replace the name column (elist) with the results of |synonymize.py -a merge -c clist_file elist_file|. The result will be a list of the same length as elist but with repeated names (same number of unique names as in clist).

4.

If we are preparing a query (eg for GBIF), then reduce the resulting data down to only rows for which there was a record in dlist and send out those dlist names. Then lookup the data according to the lookup table created in step 3.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2.

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-44889331.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-44891710.

dschwilk commented 10 years ago

Ok, so I just merged in an implementation of the matching algorithm, see f17076d. It runs pretty fast and seems to work well.

willpearse commented 10 years ago

Fantastic! Thank you! I've got a GenBank dataset too - this seems fast, I might try that if you're interested? On 5 Jun 2014 17:06, "Dylan Schwilk" notifications@github.com wrote:

Ok, so I just merged in an implementation of the matching algorithm, see f17076d https://github.com/schwilklab/taxon-name-utils/commit/f17076d. It runs pretty fast and seems to work well.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45281258 .

dschwilk commented 10 years ago

Hi Will.

Of course!

So I think this is basically there. The expanded tanktree names -> gbif matching takes about 40 min on my computer. It would be faster using pypy, but using the Levenshtein module for jaro-winkler messes up that option.

Outstanding considerations:

Tuning. My current hard coded choices: Candidate matches are all within 2 edit (levenshtein) distances. But among those, the highest Jaro-Winkler score is chosen with a threshold of 0.96. In other words, if the highest jw is <= 0.96, then there is "no match". JW similarity weights matches early in the string higher which is correct for the sorts of errors we see. I've done some casual checking of matches against the plant list and this value seems to be eliminate most false positives, but I have little idea of the false neg rate and need t look at the unmatched names in more detail (eg are there a bunch of hyphenated names in the unmatched list?)
How to deal with three part names (var., subsp. f.) The plant list has these and that can be important for synonyms --- we are already ignoring authorities (and we must in TPL expansion because that data Beth scraped has that removed already). But I looked a bit at the raw GBIF download of all plantae and it does seem that there are three part names. So I guess we have three options:

option 1: reduce every name to 2-part by ignoring the var or subsp in both data sets. I have run the expanded tanknames -> gbif lookup this way. This means that if TPL synonyms say that Agenus aspecies var. avar is a synonym of Anothergen anothersp then there is an implied synonym that Agenus aspecies is a synonym of Anothergen anothersp. This will increase hits, but may imply incorrect synonyms.

option2: Omit all three part synonyms in TPL only. I've also run the code this way and that is the way gbif_lookup.py is setup now see line 20. This may avoid false positives but will result in fewer hits.

option 3. Actually try matching on three part names when they exist. For gbif, this would require a new extraction of all the unique names from the big plantae download. How big is that file extracted (I can see it is greater than 75G!)? How long did that take to pull the names and what tricks were required? I'm assuming the python zipfile module provides an iterator?

Opinions?

-Dylan

On 06/05/2014 06:03 PM, Will Pearse wrote:

Fantastic! Thank you! I've got a GenBank dataset too - this seems fast, I might try that if you're interested? On 5 Jun 2014 17:06, "Dylan Schwilk" notifications@github.com wrote:

Ok, so I just merged in an implementation of the matching algorithm, see f17076d https://github.com/schwilklab/taxon-name-utils/commit/f17076d. It runs pretty fast and seems to work well.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45281258 .

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45286112.

dschwilk commented 10 years ago

I guess if we do option 1, we can always filter some gbif records out based on a second pass looking at the gbif file "scientificname" But it does seem inefficient.

dschwilk commented 10 years ago

Update: It looks like relaxing the Jaro-Winkler threshold is helpful. I dropped it to >= 0.95 and got 375 more hits which mostly look to be correct. But there are some false positives ("spicata" / "spica" , "alpinus"/"alpigenus" which shoulkd be separate. But so it goes. We are getting well into noise territory here for these big datasets.

My suggestion:

do a looser search and save the jw distnaces
then cull matches in which both names are in the TPL "Accepted" list and the jw similarity is lower than some high threshold (eg 0.97). Some of the double accepteds in tpl are clearly misspellings eg Astragalus campylorrhynchus vs A. campylorhynchus --- both are listed as accepted. But the jw similarity is 0.99 so we keep.

This is probably all best done in R after all the matching.

willpearse commented 10 years ago

...you make a very good point. For anything of this size, I think there's an awful lot to be said for having three sets: "confident", "possible", "no idea". If we make "possible" sufficiently small (~1000 or so) then it's quite plausible that someone can just check them all!

On 06/06/2014 11:55 AM, Dylan Schwilk wrote:

Update: It looks like relaxing the Jaro-Winkler threshold is helpful. I dropped it to >= 0.95 and got 375 more hits which mostly look to be correct. But there are some false positives ("spicata" / "spica" , "alpinus"/"alpigenus" which shoulkd be separate. But so it goes. We are getting well into noise territory here for these big datasets.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45359448.

willpearse commented 10 years ago

Hello,

Cool. My feelings are:

I think altering the hard-coded tuning is a good idea (2-->5, perhaps) for what I called before the "uncertain" category. That way we have some things we're really sure of, and a list of potentials for not-so-sure. I also think doing something special for hyphens is a good idea, because in my recent experience people tend to mess those up a lot and it causes problems. If we're not going to do something specific for latin mis-matches (genders, etc., which would be hard) then I think hyphens are a fair first step.
I think we should try matching on three-part names (options 3). I didn't do anything clever with the raw download and it still loaded in a few minutes; once it's in it's not too bad to manipulate as long as you use the right container for it.

How does that sound to you?

On 06/06/2014 09:47 AM, Dylan Schwilk wrote:

Hi Will.

Of course!

So I think this is basically there. The expanded tanktree names -> gbif matching takes about 40 min on my computer. It would be faster using pypy, but using the Levenshtein module for jaro-winkler messes up that option.

Outstanding considerations:

Tuning. My current hard coded choices: Candidate matches are all within 2 edit (levenshtein) distances. But among those, the highest Jaro-Winkler score is chosen with a threshold of 0.96. In other words, if the highest jw is <= 0.96, then there is "no match". JW similarity weights matches early in the string higher which is correct for the sorts of errors we see. I've done some casual checking of matches against the plant list and this value seems to be eliminate most false positives, but I have little idea of the false neg rate and need t look at the unmatched names in more detail (eg are there a bunch of hyphenated names in the unmatched list?)

How to deal with three part names (var., subsp. f.) The plant list has these and that can be important for synonyms --- we are already ignoring authorities (and we must in TPL expansion because that data Beth scraped has that removed already). But I looked a bit at the raw GBIF download of all plantae and it does seem that there are three part names. So I guess we have three options:

option 1: reduce every name to 2-part by ignoring the var or subsp in both data sets. I have run the expanded tanknames -> gbif lookup this way. This means that if TPL synonyms say that Agenus aspecies var. avar is a synonym of Anothergen anothersp then there is an implied synonym that Agenus aspecies is a synonym of Anothergen anothersp. This will increase hits, but may imply incorrect synonyms.

option2: Omit all three part synonyms in TPL only. I've also run the code this way and that is the way gbif_lookup.py is setup now see line 20. This may avoid false positives but will result in fewer hits.

option 3. Actually try matching on three part names when they exist. For gbif, this would require a new extraction of all the unique names from the big plantae download. How big is that file extracted (I can see it is greater than 75G!)? How long did that take to pull the names and what tricks were required? I'm assuming the python zipfile module provides an iterator?

Opinions?

-Dylan

On 06/05/2014 06:03 PM, Will Pearse wrote:

Fantastic! Thank you! I've got a GenBank dataset too - this seems fast, I might try that if you're interested? On 5 Jun 2014 17:06, "Dylan Schwilk" notifications@github.com wrote:

Ok, so I just merged in an implementation of the matching algorithm, see f17076d https://github.com/schwilklab/taxon-name-utils/commit/f17076d. It runs pretty fast and seems to work well.

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45281258 .

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45286112.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45344807.

dschwilk commented 10 years ago

... Now we are hitting the limits of pure fuzzy matching without any domain specific intelligence (other than splitting into genus/species and the jw prefix weighting). One could get fancy and build some knowledge of latin nomenclature and/or soundex in.

In any case, the best strategy is use fuzzy_match.py to get a list and plan to overmatch. For my data, the number of non exact matches in the output is only in the 2-3K range (Here is where synonymize.py helps, as the expanded list is already accounting for some misspellings that are known synonyms)

I just ran the gbif->tanknames matching with a jw cut off of 0.94 which is right where "sylvestris" can still match with "silvestris" (usually weighting prefixes helps, but not in that case). That leaves about exactly 1000 with 0.94 <= jw < 0.97. Which is the "maybe area", (I think!)

For now, I tried to see what else I could do automatically:

I wrote some R code to identify "suspected bad matches" : those in which both names are TPL "accepted" names. But I don't want to throw out all of those because there are clearly some errors in TPL (eg "Haplocarpha rueppelii" and "Haplocarpha rueppellii" both being accepted names, lots of examples in TPL).

So for now, I consider it a "bad match" IF (both names are TPL "accepted" AND (genera don't match OR JW score < 0.96) ). That criterion eliminated about 160 matches.

But it might be better to draw the line further to one side or the other and then do some more manual checking because all of these automated methods still have false positives and negatives. And in many cases there is no easy even manual way to verify.

And the dangerous side of this: I want to go learn about all of these cute plants whose names I'm reading! NO TIME.

On 06/06/2014 03:00 PM, Will Pearse wrote:

...you make a very good point. For anything of this size, I think there's an awful lot to be said for having three sets: "confident", "possible", "no idea". If we make "possible" sufficiently small (~1000 or so) then it's quite plausible that someone can just check them all!

W

On 06/06/2014 11:55 AM, Dylan Schwilk wrote:

Update: It looks like relaxing the Jaro-Winkler threshold is helpful. I dropped it to >= 0.95 and got 375 more hits which mostly look to be correct. But there are some false positives ("spicata" / "spica" , "alpinus"/"alpigenus" which shoulkd be separate. But so it goes. We are getting well into noise territory here for these big datasets.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45359448.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45378200.

dschwilk commented 10 years ago

Oops, sorry passing emails.

regarding your ideas:

I think altering the hard-coded tuning is a good idea (2-->5, perhaps) for what I called before the "uncertain" category. That way we have some things we're really sure of, and a list of potentials for not-so-sure. I also think doing something special for hyphens is a good idea, because in my recent experience people tend to mess those up a lot and it causes problems. If we're not going to do something specific for latin mis-matches (genders, etc., which would be hard) then I think hyphens are a fair first step.

Hyphens: Yups, definitely. I just have not done it yet. Does need to be done. The "unmatched" gbif list has a lot of hyphen names in it. Current code handles the missing hyphen fine but will miss a match if one name drops the last part.

levenshtein distnaces. Hm, 5 edits is a LOT and that will make a big automaton! But we could switch to the other trie based memoizing method which may scale better for larger edit distances. Can you give some examples of matches where edit distance of greater than 3 is needed?

Jaro Winkler similarities: down to 0.9 is porbably doable, as long as we plan on lots of culling afterwards.

I think we should try matching on three-part names (options 3). I didn't do anything clever with the raw download and it still loaded in a few minutes; once it's in it's not too bad to manipulate as long as you use the right container for it.

I have code ready to pull other fields as well, each line read is pretty fast but I don't even know how many records are in it.

What do you suggest for design of the fuzzy match algorithm then once we open the possibility of more than two name parts? And if the third part can't be matched is that a non-match or do we fallback to 2 part and potentially overmatch?

How does that sound to you?

W

On 06/06/2014 09:47 AM, Dylan Schwilk wrote:

Hi Will.

Of course!

So I think this is basically there. The expanded tanktree names -> gbif matching takes about 40 min on my computer. It would be faster using pypy, but using the Levenshtein module for jaro-winkler messes up that option.

Outstanding considerations:

Tuning. My current hard coded choices: Candidate matches are all within 2 edit (levenshtein) distances. But among those, the highest Jaro-Winkler score is chosen with a threshold of 0.96. In other words, if the highest jw is <= 0.96, then there is "no match". JW similarity weights matches early in the string higher which is correct for the sorts of errors we see. I've done some casual checking of matches against the plant list and this value seems to be eliminate most false positives, but I have little idea of the false neg rate and need t look at the unmatched names in more detail (eg are there a bunch of hyphenated names in the unmatched list?)

How to deal with three part names (var., subsp. f.) The plant list has these and that can be important for synonyms --- we are already ignoring authorities (and we must in TPL expansion because that data Beth scraped has that removed already). But I looked a bit at the raw GBIF download of all plantae and it does seem that there are three part names. So I guess we have three options:

option 1: reduce every name to 2-part by ignoring the var or subsp in both data sets. I have run the expanded tanknames -> gbif lookup this way. This means that if TPL synonyms say that Agenus aspecies var. avar is a synonym of Anothergen anothersp then there is an implied synonym that Agenus aspecies is a synonym of Anothergen anothersp. This will increase hits, but may imply incorrect synonyms.

option2: Omit all three part synonyms in TPL only. I've also run the code this way and that is the way gbif_lookup.py is setup now [see line

20](https://github.com/schwilklab/taxon-name-utils/blob/master/scripts/gbif_lookup.py#L20). This may avoid false positives but will result in fewer hits.

option 3. Actually try matching on three part names when they exist. For gbif, this would require a new extraction of all the unique names from the big plantae download. How big is that file extracted (I can see it is greater than 75G!)? How long did that take to pull the names and what tricks were required? I'm assuming the python zipfile module provides an iterator?

Opinions?

-Dylan

On 06/05/2014 06:03 PM, Will Pearse wrote:

Fantastic! Thank you! I've got a GenBank dataset too - this seems fast, I might try that if you're interested? On 5 Jun 2014 17:06, "Dylan Schwilk" notifications@github.com wrote:

Ok, so I just merged in an implementation of the matching algorithm, see f17076d https://github.com/schwilklab/taxon-name-utils/commit/f17076d. It runs pretty fast and seems to work well.

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45281258 .

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45286112.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45344807.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45381028.

AmyZanne commented 10 years ago

This is super exciting! Glad to see you have been able to make such good progress. I think it's important to recognize we may end up with some slop (good names that get thrown out or bad names that get accepted). I am happy to help with hand scrubbing if the list is modest but expect that this may be untenable when cleaning up such a huge list of names. In case it's helpful, Will C spent a fair bit of time coming up with our cleaning routines, in conjunction with Ginger Jui. I copy below what he wrote for the Nature paper.

And, I agree with Dylan. The names have a powerful siren call, making me want to figure out who they are, what they look like, where they grow...

Best, Amy

To bring species’ binomials to a common taxonomy among datasets, names were matched against accepted names in The Plant List ( http://www.theplantlist.org/). Any binomials not found in this list were matched against the International Plant Names Index (IPNI; http://www.ipni.org/) and Tropicos (http://www.tropicos.org/); potential synonymy in binomials arising from the three lists was investigated using The Plant List tools. Binomials remaining unmatched were compared first to The Plant List and next to IPNI with an approximate matching algorithm. For binomials with accepted generic names but unmatched binomials, we searched for specific epithet misspellings within the genus followed by a broadened search to all plants to check if the generic name was incorrect. We then searched for unmatched genera. For this list of binomials with unmatched genera, we searched the full list of genera. This led to many erroneous matches. We found that including specific epithet in the approximate matching algorithm with the full list of binomials improved determination of the correct genus. With the steps above and a strict approximate grepping-matching threshold (roughly corresponding to one letter substitution or a gender error in the specific epithet) and when there was only one match returned, the false positive rate was low (<1%) and could be automated. When the threshold was relaxed to look for names that still did not match, the false positive rate rose to unacceptable levels. For these species and for those that returned multiple matches, we examined and made potential substitutions on a case-by-case basis.

On Fri, Jun 6, 2014 at 3:59 PM, Dylan Schwilk notifications@github.com wrote:

Oops, sorry passing emails.

regarding your ideas:

I think altering the hard-coded tuning is a good idea (2-->5, perhaps) for what I called before the "uncertain" category. That way we have some things we're really sure of, and a list of potentials for not-so-sure. I also think doing something special for hyphens is a good idea, because in my recent experience people tend to mess those up a lot and it causes problems. If we're not going to do something specific for latin mis-matches (genders, etc., which would be hard) then I think hyphens are a fair first step.

Hyphens: Yups, definitely. I just have not done it yet. Does need to be done. The "unmatched" gbif list has a lot of hyphen names in it. Current code handles the missing hyphen fine but will miss a match if one name drops the last part.

levenshtein distnaces. Hm, 5 edits is a LOT and that will make a big automaton! But we could switch to the other trie based memoizing method which may scale better for larger edit distances. Can you give some examples of matches where edit distance of greater than 3 is needed?

Jaro Winkler similarities: down to 0.9 is porbably doable, as long as we plan on lots of culling afterwards.

I think we should try matching on three-part names (options 3). I didn't do anything clever with the raw download and it still loaded in a few minutes; once it's in it's not too bad to manipulate as long as you use the right container for it.

I have code ready to pull other fields as well, each line read is pretty fast but I don't even know how many records are in it.

What do you suggest for design of the fuzzy match algorithm then once we open the possibility of more than two name parts? And if the third part can't be matched is that a non-match or do we fallback to 2 part and potentially overmatch?

How does that sound to you?

W

On 06/06/2014 09:47 AM, Dylan Schwilk wrote:

Hi Will.

Of course!

So I think this is basically there. The expanded tanktree names -> gbif matching takes about 40 min on my computer. It would be faster using pypy, but using the Levenshtein module for jaro-winkler messes up that option.

Outstanding considerations:

Tuning. My current hard coded choices: Candidate matches are all within 2 edit (levenshtein) distances. But among those, the highest Jaro-Winkler score is chosen with a threshold of 0.96. In other words, if the highest jw is <= 0.96, then there is "no match". JW similarity weights matches early in the string higher which is correct for the sorts of errors we see. I've done some casual checking of matches against the plant list and this value seems to be eliminate most false positives, but I have little idea of the false neg rate and need t look at the unmatched names in more detail (eg are there a bunch of hyphenated names in the unmatched list?)

How to deal with three part names (var., subsp. f.) The plant list has these and that can be important for synonyms --- we are already ignoring authorities (and we must in TPL expansion because that data Beth scraped has that removed already). But I looked a bit at the raw GBIF download of all plantae and it does seem that there are three part names. So I guess we have three options:

option 1: reduce every name to 2-part by ignoring the var or subsp in both data sets. I have run the expanded tanknames -> gbif lookup this way. This means that if TPL synonyms say that Agenus aspecies var. avar is a synonym of Anothergen anothersp then there is an implied synonym that Agenus aspecies is a synonym of Anothergen anothersp. This will increase hits, but may imply incorrect synonyms.

option2: Omit all three part synonyms in TPL only. I've also run the code this way and that is the way gbif_lookup.py is setup now [see line

20]( https://github.com/schwilklab/taxon-name-utils/blob/master/scripts/gbif_lookup.py#L20 ). This may avoid false positives but will result in fewer hits.

option 3. Actually try matching on three part names when they exist. For gbif, this would require a new extraction of all the unique names from the big plantae download. How big is that file extracted (I can see it is greater than 75G!)? How long did that take to pull the names and what tricks were required? I'm assuming the python zipfile module provides an iterator?

Opinions?

-Dylan

On 06/05/2014 06:03 PM, Will Pearse wrote:

Fantastic! Thank you! I've got a GenBank dataset too - this seems fast, I might try that if you're interested? On 5 Jun 2014 17:06, "Dylan Schwilk" notifications@github.com wrote:

Ok, so I just merged in an implementation of the matching algorithm, see f17076d https://github.com/schwilklab/taxon-name-utils/commit/f17076d. It runs pretty fast and seems to work well.

— Reply to this email directly or view it on GitHub

< https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45281258

.

— Reply to this email directly or view it on GitHub

< https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45286112 .

— Reply to this email directly or view it on GitHub < https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45344807 .

— Reply to this email directly or view it on GitHub < https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45381028 .

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45383713 .

Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/

willpearse commented 10 years ago

I've replied in-line, sorry to be off over the weekend!

On 06/06/2014 03:59 PM, Dylan Schwilk wrote:

Oops, sorry passing emails.

regarding your ideas:

I think altering the hard-coded tuning is a good idea (2-->5, perhaps) for what I called before the "uncertain" category. That way we have some things we're really sure of, and a list of potentials for not-so-sure. I also think doing something special for hyphens is a good idea, because in my recent experience people tend to mess those up a lot and it causes problems. If we're not going to do something specific for latin mis-matches (genders, etc., which would be hard) then I think hyphens are a fair first step.

Hyphens: Yups, definitely. I just have not done it yet. Does need to be done. The "unmatched" gbif list has a lot of hyphen names in it. Current code handles the missing hyphen fine but will miss a match if one name drops the last part.

levenshtein distnaces. Hm, 5 edits is a LOT and that will make a big automaton! But we could switch to the other trie based memoizing method which may scale better for larger edit distances. Can you give some examples of matches where edit distance of greater than 3 is needed? Perhaps 5 was a bit over the top. In working with Amy before, I've sent her lists of 'the closest 5' or some-such, which she's then looked at and used her botanical judgement. I'm happy to be ignored on this one; it's not something I hold dear.

Jaro Winkler similarities: down to 0.9 is porbably doable, as long as we plan on lots of culling afterwards.

I think we should try matching on three-part names (options 3). I didn't do anything clever with the raw download and it still loaded in a few minutes; once it's in it's not too bad to manipulate as long as you use the right container for it.

I have code ready to pull other fields as well, each line read is pretty fast but I don't even know how many records are in it.

What do you suggest for design of the fuzzy match algorithm then once we open the possibility of more than two name parts? And if the third part can't be matched is that a non-match or do we fallback to 2 part and potentially overmatch? I think a fall-back to a two-name is an important step; particularly if we're working with a phylogeny since most phylogenies aren't going to have subspecies on them. I can't think of a smart way to implement that neatly other than a fall-through; something like if no three-way match then split and search on the first two.

How does that sound to you?

W

On 06/06/2014 09:47 AM, Dylan Schwilk wrote:

Hi Will.

Of course!

So I think this is basically there. The expanded tanktree names -> gbif matching takes about 40 min on my computer. It would be faster using pypy, but using the Levenshtein module for jaro-winkler messes up that option.

Outstanding considerations:

Tuning. My current hard coded choices: Candidate matches are all within 2 edit (levenshtein) distances. But among those, the highest Jaro-Winkler score is chosen with a threshold of 0.96. In other words, if the highest jw is <= 0.96, then there is "no match". JW similarity weights matches early in the string higher which is correct for the sorts of errors we see. I've done some casual checking of matches against the plant list and this value seems to be eliminate most false positives, but I have little idea of the false neg rate and need t look at the unmatched names in more detail (eg are there a bunch of hyphenated names in the unmatched list?)

How to deal with three part names (var., subsp. f.) The plant list has these and that can be important for synonyms --- we are already ignoring authorities (and we must in TPL expansion because that data Beth scraped has that removed already). But I looked a bit at the raw GBIF download of all plantae and it does seem that there are three part names. So I guess we have three options:

option 1: reduce every name to 2-part by ignoring the var or subsp in both data sets. I have run the expanded tanknames -> gbif lookup this way. This means that if TPL synonyms say that Agenus aspecies var. avar is a synonym of Anothergen anothersp then there is an implied synonym that Agenus aspecies is a synonym of Anothergen anothersp. This will increase hits, but may imply incorrect synonyms.

option2: Omit all three part synonyms in TPL only. I've also run the code this way and that is the way gbif_lookup.py is setup now [see line

20](https://github.com/schwilklab/taxon-name-utils/blob/master/scripts/gbif_lookup.py#L20).

This may avoid false positives but will result in fewer hits.

option 3. Actually try matching on three part names when they exist. For gbif, this would require a new extraction of all the unique names from the big plantae download. How big is that file extracted (I can see it is greater than 75G!)? How long did that take to pull the names and what tricks were required? I'm assuming the python zipfile module provides an iterator?

Opinions?

-Dylan

On 06/05/2014 06:03 PM, Will Pearse wrote:

Fantastic! Thank you! I've got a GenBank dataset too - this seems fast, I might try that if you're interested? On 5 Jun 2014 17:06, "Dylan Schwilk" notifications@github.com wrote:

Ok, so I just merged in an implementation of the matching algorithm, see f17076d https://github.com/schwilklab/taxon-name-utils/commit/f17076d. It runs pretty fast and seems to work well.

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45281258

.

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45286112.

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45344807.

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45381028.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45383713.

AmyZanne commented 10 years ago

It's good to know I get fed lists of 5.

I like the fall back to two. I know we cut out any subspecies for the Tank tree just leaving binomials so overmatching should be ok since that matches what we have elsewhere as long as we stick to that tree.

On Mon, Jun 9, 2014 at 9:19 AM, Will Pearse notifications@github.com wrote:

I've replied in-line, sorry to be off over the weekend!

W

On 06/06/2014 03:59 PM, Dylan Schwilk wrote:

Oops, sorry passing emails.

regarding your ideas:

I think altering the hard-coded tuning is a good idea (2-->5, perhaps) for what I called before the "uncertain" category. That way we have some things we're really sure of, and a list of potentials for not-so-sure. I also think doing something special for hyphens is a good idea, because in my recent experience people tend to mess those up a lot and it causes problems. If we're not going to do something specific for latin mis-matches (genders, etc., which would be hard) then I think hyphens are a fair first step.

Hyphens: Yups, definitely. I just have not done it yet. Does need to be done. The "unmatched" gbif list has a lot of hyphen names in it. Current code handles the missing hyphen fine but will miss a match if one name drops the last part.

levenshtein distnaces. Hm, 5 edits is a LOT and that will make a big automaton! But we could switch to the other trie based memoizing method which may scale better for larger edit distances. Can you give some examples of matches where edit distance of greater than 3 is needed? Perhaps 5 was a bit over the top. In working with Amy before, I've sent her lists of 'the closest 5' or some-such, which she's then looked at and used her botanical judgement. I'm happy to be ignored on this one; it's not something I hold dear.

Jaro Winkler similarities: down to 0.9 is porbably doable, as long as we plan on lots of culling afterwards.

I think we should try matching on three-part names (options 3). I didn't do anything clever with the raw download and it still loaded in a few minutes; once it's in it's not too bad to manipulate as long as you use the right container for it.

I have code ready to pull other fields as well, each line read is pretty fast but I don't even know how many records are in it.

What do you suggest for design of the fuzzy match algorithm then once we open the possibility of more than two name parts? And if the third part can't be matched is that a non-match or do we fallback to 2 part and potentially overmatch? I think a fall-back to a two-name is an important step; particularly if we're working with a phylogeny since most phylogenies aren't going to have subspecies on them. I can't think of a smart way to implement that neatly other than a fall-through; something like if no three-way match then split and search on the first two.

How does that sound to you?

W

On 06/06/2014 09:47 AM, Dylan Schwilk wrote:

Hi Will.

Of course!

So I think this is basically there. The expanded tanktree names -> gbif matching takes about 40 min on my computer. It would be faster using pypy, but using the Levenshtein module for jaro-winkler messes up that option.

Outstanding considerations:

Tuning. My current hard coded choices: Candidate matches are all within 2 edit (levenshtein) distances. But among those, the highest Jaro-Winkler score is chosen with a threshold of 0.96. In other words, if the highest jw is <= 0.96, then there is "no match". JW similarity weights matches early in the string higher which is correct for the sorts of errors we see. I've done some casual checking of matches against the plant list and this value seems to be eliminate most false positives, but I have little idea of the false neg rate and need t look at the unmatched names in more detail (eg are there a bunch of hyphenated names in the unmatched list?)

How to deal with three part names (var., subsp. f.) The plant list has these and that can be important for synonyms --- we are already ignoring authorities (and we must in TPL expansion because that data Beth scraped has that removed already). But I looked a bit at the raw GBIF download of all plantae and it does seem that there are three part names. So I guess we have three options:

option 1: reduce every name to 2-part by ignoring the var or subsp in both data sets. I have run the expanded tanknames -> gbif lookup this way. This means that if TPL synonyms say that Agenus aspecies var. avar is a synonym of Anothergen anothersp then there is an implied synonym that Agenus aspecies is a synonym of Anothergen anothersp. This will increase hits, but may imply incorrect synonyms.

option2: Omit all three part synonyms in TPL only. I've also run the code this way and that is the way gbif_lookup.py is setup now [see line

20]( https://github.com/schwilklab/taxon-name-utils/blob/master/scripts/gbif_lookup.py#L20 ).

This may avoid false positives but will result in fewer hits.

option 3. Actually try matching on three part names when they exist. For gbif, this would require a new extraction of all the unique names from the big plantae download. How big is that file extracted (I can see it is greater than 75G!)? How long did that take to pull the names and what tricks were required? I'm assuming the python zipfile module provides an iterator?

Opinions?

-Dylan

On 06/05/2014 06:03 PM, Will Pearse wrote:

Fantastic! Thank you! I've got a GenBank dataset too - this seems fast, I might try that if you're interested? On 5 Jun 2014 17:06, "Dylan Schwilk" notifications@github.com wrote:

Ok, so I just merged in an implementation of the matching algorithm, see f17076d https://github.com/schwilklab/taxon-name-utils/commit/f17076d. It runs pretty fast and seems to work well.

— Reply to this email directly or view it on GitHub

< https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45281258

.

— Reply to this email directly or view it on GitHub

< https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45286112 .

— Reply to this email directly or view it on GitHub

< https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45344807 .

— Reply to this email directly or view it on GitHub

< https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45381028 .

— Reply to this email directly or view it on GitHub < https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45383713 .

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45495304 .

Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/

dschwilk commented 10 years ago

Hi folks,

Will: I have yet to deal with testing for hyphenated names in which the second part was dropped. I am assuming that we expect the drop to be in dist, not elist? Or should we treat it symmetrically? I can see no easy and fast way to deal with this as it needs to happen before anything else and will be a rate-limiting step.

That aside, has any one else done any experimenting with the code and lists I produced? Here are my observations.

Levenshtein edit distances of 2 I believe are generally sufficient within specific epithets. This captures all gender changes or consonant doubling. Going to edit distance 3 combined with a Jaro-Winkler cutoff is also fine, but a bit slower. I have written a function to explicitly test if a match is one of simple gender switching (https://github.com/schwilklab/taxon-name-utils/blob/schwilk-work/scripts/fuzzy_match.py#L25). This code is not necessary for the matching itself because such switches are captured by the two step process already employed. Rather, this could be harnessed as part of the post matching culling as such matches could be given a high degree of certainty.
Jaro Winkler similarities less than about 0.94 become dominated by false positives. Certainly below 0.93 the majority are false positives (close to 100%, in fact when edit distances up to 3 are allowed).
For the Plants and Fire project tank tree -> gbif lookup I think sticking with binomials is fine for the reason Amy stated. Any "overmatch" in that case is at least in the correct binomial, It is also possible to check these later as I will keep the tank tree name, gbif name and matched TPL name as three separate columns in our final data. I think matching on 3 part names is worth working on, but I'm willing to move ahead with the Fire and Plants bigphylo project without it.

I will post a second comment with my current prosed matching algorithm for the Fire and Plants project.

dschwilk commented 10 years ago

My revised outline algorithm for matching the expanded tank names to gbif based on my experimentation:

Match genus (highest Jaro-Winkler (JW) similarity among all candidates in elist that are within 1 edit distance of dlist name, no match if JW distance > 0.95
Match specific epithets within matched genus, but use edit distance of 2 (perhaps 3) for candidates and a JW cut off of 0.94

These steps should result in a higher false positive than false negative rate. My quick estimations put the false positive rate around 5% of non-exact matches . These can be culled. My proposed culling algorithm:

Keep all simple gender mismatches (code written)
Proposed: keep all simply consonant doubling? Perhaps not necessary as these should no be removed by steps below.
Mark all cases in which both names are TPL "Accepted Names" as suspect and cull if JW similarity does not meet a high threshold ( eg > 0.96 or 0.97)
Final manual pass on remaining matches with JW similarity below some threshold (0.96?)

willpearse commented 10 years ago

My quick question would be - what percentage are exact matches? If the below has an error of 5% on the non-exact matches, then this is fantastic, but if it's 5% on the whole thing then it's still good but maybe needs a quick check.

Since this seems fine, I think the best approach is just to run it on the whole thing and see what we find. Something that looks good/bad a priori might look very different on the real thing.

On 06/10/2014 09:55 AM, Dylan Schwilk wrote:

My revised outline algorithm for matching the expanded tank names to gbif based on my experimentation:

Match genus (highest Jaro-Winkler (JW) similarity among all candidates in elist that are within 1 edit distance of dlist name, no match if JW distance > 0.95

Match specific epithets within matched genus, but use edit distance of 2 (perhaps 3) for candidates and a JW cut off of 0.94

The initial step seems low to me, but as this is based on experimentation I'm very happy with this.

These steps should result in a higher false positive than false negative rate. My quick estimations put the false positive rate around 5% of non-exact matches . These can be culled. My proposed culling algorithm:

Keep all simple gender mismatches (code written) 2 ? Proposed: keep all simply consonant doubling? Perhaps not necessary as these should no be removed by steps below.

Mark all cases in which both names are TPL "Accepted Names" as suspect and cull if JW similarity does not meet a high threshold ( eg > 0.96 or 0.97)

Final manual pass on remaining matches with JW similarity below some threshold (0.96?)

...seems a good approach to me.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45624295.

dschwilk commented 10 years ago

Hi,

I just pushed some code to the schwilk-work branch that is a complete set of steps for what I described above. I also pushed the output of the fuzzy matching which now saves the Jaro winkler distances and includes a boolean column for gender switches. To answer you questions on statistics regarding the matches:

First, ignore that statement about 5%. more accurate numbers are below:

Total matches gbif -> expanded tanknames: 47718
That means that 262671 gbif names are left unmatched (but that makes sense, this is just against the tank tree)
Exact matches: 45182 (94.7%)
fuzzy matches: 2536 (5.3%)
Number of fuzzy matches that are both TPL names (not just "accepted"): 392
Number of fuzzy matches identified as "suspect" (specific epithet jw or genus jw < 0.96): 751
number of above deemed automatically removable (all suspects for which both names are tpl names, but exclude clear simple gender changes -- TPL is not perfect!): 183
Fuzzy matches worth checking manually (possible false postive suspects not in line above): 547. It looks like many of these with a jw < 0.95 are probably false positives. I think this is doable.

I can do the exact same thing with slightly looser initial criteria for edit distances and jw sim see what we get. Maybe just loosen the genus step as you suggest.

willpearse commented 10 years ago

Frankly, this is fantastic. I think 547 is more than doable; if the list of them is outputted somewhere at the end of the script I am happy to set aside tomorrow to going through them as I've hardly been pulling my weight with code!

This is fantastic!

Will

On 06/10/2014 03:19 PM, Dylan Schwilk wrote:

Hi,

I just pushed some code to the schwilk-work branch that is a complete set of steps for what I described above. I also pushed the output of the fuzzy matching which now saves the Jaro winkler distances and includes a boolean column for gender switches. To answer you questions on statistics regarding the matches:

First, ignore that statement about 5%. more accurate numbers are below:

Total matches gbif -> expanded tanknames: 47718

That means that 262671 gbif names are left unmatched (but that makes sense, this is just against the tank tree)

Exact matches: 45182 (94.7%)

fuzzy matches: 2536 (5.3%)

Number of fuzzy matches that are both TPL names (not just "accepted"): 392

Number of fuzzy matches identified as "suspect" (specific epithet jw or genus jw < 0.96): 751

number of above deemed automatically removable (all suspects for which both names are tpl names, but exclude clear simple gender changes -- TPL is not perfect!): 183

Fuzzy matches worth checking manually (possible false postive suspects not in line above): 547. It looks like many of these with a jw < 0.95 are probably false positives. I think this is doable.

I can do the exact same thing with slightly looser initial criteria for edit distances and jw sim see what we get. Maybe just loosen the genus step as you suggest.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45666050.

dschwilk commented 10 years ago

I pushed the cleaned up version to the data/name-lists folder: https://github.com/schwilklab/taxon-name-utils/blob/schwilk-work/data/name-lists/gbif_tank_lookup_140610_cleaned.csv. By sorting on the various columns one can see which names to check. The columns are:

gbif: gbif name
tank: tank tree name
genus_jw: Jaro-Winkler similarity score for genus match
se_je: same for specific epithet match
gswitch: identified as a gender switch in the se, very likely a good match! Note that this is "True" and "False" not "TRUE" or "FALSE" so care in reading in R
bothtpl: Both names are TPL names of some sort (binomial)
suspect: low jw scores
remove: marked as able to remove: suspect AND ! gswitch AND bothtpl

It is also worth manually perusing the ones marked for automatic removal that have high JW scores (There are a few clear false negatives in there)

I've tried to "overmatch" but keep the final manual checking reasonable. There may still be some true matches that are never output (false negatives), but I am confident it is a low percentage based on visually inspecting all matches with low JW scores when I tried some runs with edit distances of 3 and JW cutoffs of 0.9.

In a perfect world, it would be nice to have a different matric than the Jaro-Winkler for that step. One which weighted prefixes as does jw but also understood phonetics (ae for i, i vs y, etc). But JW does pretty well as long as we overmatch a bit.

dschwilk commented 10 years ago

To be clear, the file I refer to above is simply the result of running: https://github.com/schwilklab/taxon-name-utils/blob/schwilk-work/scripts/gbif_lookup.py

and then running this R code on the result of that: https://github.com/schwilklab/taxon-name-utils/blob/schwilk-work/scripts/clean_gbif2tankname.R

All the paths should work. Sorry it is a bit ugly and hackish still.

willpearse commented 10 years ago

I'm moving through in a fairly unsystematic fashion, but I'm very happy with what I'm seeing! A few things:

remove==TRUE - look like good removals to me. I really like the bothtpl==TRUE check; that convinces me in these cases that I can't make a definitive call. I'm actually quite happy keeping all of them dropped and letting that be it.
bothtpl==TRUE & suspect==FALSE - opinion starts to creep in here (which is no bad thing), but there are some cases (e.g., GBIF-Croton setigerus, Tank-Croton setiger) where they are both in TPL and are both listed with 2-star or 3-star confidence. Thus I'm slightly less happy merging them as the same species in the two datasets (even though, frankly, they look like spelling mistakes to me). Do we have information on the confidence with which names are ascribed to the TPL entries? If we do, this step could be automated with some criterion.

On 06/10/2014 03:42 PM, Dylan Schwilk wrote:

I pushed the cleaned up version to the data/name-lists folder: https://github.com/schwilklab/taxon-name-utils/blob/schwilk-work/data/name-lists/gbif_tank_lookup_140610_cleaned.csv. By sorting on the various columns one can see which names to check. The columns are:

gbif: gbif name

tank: tank tree name

genus_jw: Jaro winker similarity score for genus match

se_je: same for specific epithet match

gswitch: identified as a gender switch in the se, very likely a good match! Note that this is "True" and "False" not "TRUE" or "FALSE" so care in reading in R

bothtpl: Both names are TPL names of some sort (binomial)

suspect: low jw scores

remove: marked as able to remove: suspect AND ! gswitch AND bothtpl

It is also worth manually perusing the ones marked for automatic removal that have high JW scores (There are a few clear false negatives in there)

I've tried to "overmatch" but keep the final manual checking reasonable. There may still be some true matches that are never output (false negatives), but I am confident it is a low percentage based on visually inspecting all matches with low JW scores when I tried some runs with edit distances of 3 and JW cutoffs of 0.9.

In a perfect world, it would be nice to have a different matric than the Jaro-Winkler for that step. One which weighted prefixes as does jw but also understood phonetics (ae for i, i vs y, etc). But JW does pretty well as long as we overmatch a bit.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45668753.

dschwilk commented 10 years ago

Yeah, I've been thinking about how to automate more. I improved the is_gender_switch function a bit.

Unfortunately, the data Beth scraped does not have the confidence stars. Actually, I realized a problem just now checking "names_unique.csv": this file has all names with status "Accepted" or "Unresolved" while those with status "Synonym" are only in the synonym table. Not a huge deal, but we need to decide which names to check against in the R cleaning script that produced the bothtpl column. In the data you are perusing, "bothtpl" was checked against _names_unique.csv": both unresolved and accepted names, but I had thought it was all TPL names. Of course synonyms of our actual clist names are already in elist. I should probably make up a all_tpl name list. Not hard.

ejforrestel commented 10 years ago

If need be, we can get more data from TPL. I just initially retrieved the names on the synonym list for each accepted species. It would be easy to match those up to the unresolved list as well.

You guys are making awesome progress on this!

Beth

On Wed, Jun 11, 2014 at 12:48 PM, Dylan Schwilk notifications@github.com wrote:

Yeah, I've been thinking about how to automate more. I improved the is_gender_switch function a bit.

Unfortunately, the data Beth scraped does not have the confidence stars. Actually, I realized a problem just now checking "names_unique.csv": this file has all names with status "Accepted" or "Unresolved" while those with status "Synonym" are only in the synonym table. Not a huge deal, but we need to decide which names to check against in the R cleaning script that produced the bothtpl column. In the data you are perusing, "bothtpl" was checked against _names_unique.csv": both unresolved and accepted names, but I had thought it was all TPL names. Of course synonyms of our actual clist names are already in elist. I should probably make up a all_tpl name list. Not hard.

Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45767991 .

dschwilk commented 10 years ago

Hi Beth,

For names, we are set because you provided all the names. It is just that the "Accepted" and "Synonym"s are in the synonym ragged array (TPL1_1_synonymy_list) and the "Unresolved" and "Accepted"s are names_unique.csv. You had told me this I am sure, I just forgot when I was using names_unique.csv!

How hard would it be (long would it take?) to pull the confidence levels? Actually, for the future, it would be cool to have authors too. But no need for now. This TPL data is a huge resource to have locally.

willpearse commented 10 years ago

Hello Beth too,

Is there any chance of you putting the crawler script up somewhere? Maybe that way we could grab it and not hassle you! :p

Will

On 06/11/2014 12:05 PM, Dylan Schwilk wrote:

Hi Beth,

For names, we are set because you provided all the names. It is just that the "Accepted" and "Synonym"s are in the synonym ragged array (|TPL1_1_synonymy_list|) and the "Unresolved" and "Accepted"s are |names_unique.csv|. You had told me this I am sure, I just forgot when I was using |names_unique.csv|!

How hard would it be (long would it take?) to pull the confidence levels? Actually, for the future, it would be cool to have authors too. But no need for now. This TPL data is a huge resource to have locally.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45770143.

dschwilk commented 10 years ago

Beth put it up in the repo on the Plants and Fire organization, I can move it over to this repo, I assume that is ok with you Beth? There is some overlap of code and data now between those repos, but I think it makes sense to keep all of the general name matching in this current repo.

-Dylan

On 06/11/2014 12:07 PM, Will Pearse wrote:

Hello Beth too,

Is there any chance of you putting the crawler script up somewhere? Maybe that way we could grab it and not hassle you! :p

Will

On 06/11/2014 12:05 PM, Dylan Schwilk wrote:

Hi Beth,

For names, we are set because you provided all the names. It is just that the "Accepted" and "Synonym"s are in the synonym ragged array (|TPL1_1_synonymy_list|) and the "Unresolved" and "Accepted"s are |names_unique.csv|. You had told me this I am sure, I just forgot when I was using |names_unique.csv|!

How hard would it be (long would it take?) to pull the confidence levels? Actually, for the future, it would be cool to have authors too. But no need for now. This TPL data is a huge resource to have locally.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45770143.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45770340.

ejforrestel commented 10 years ago

The script should be up there with the TPL names. And it shouldn't be too difficult to modify it if need be, it is just a matter of parsing the html code further using regex. I warn you, I am not a programmer, so the code may be a bit hack-ish!

I would run it here but I am maxed out at the moment with mr bayes runs, but I am happy to do so in the not too distant future.

Beth

On Wed, Jun 11, 2014 at 1:07 PM, Will Pearse notifications@github.com wrote:

Hello Beth too,

Is there any chance of you putting the crawler script up somewhere? Maybe that way we could grab it and not hassle you! :p

Will

On 06/11/2014 12:05 PM, Dylan Schwilk wrote:

Hi Beth,

For names, we are set because you provided all the names. It is just that the "Accepted" and "Synonym"s are in the synonym ragged array (|TPL1_1_synonymy_list|) and the "Unresolved" and "Accepted"s are |names_unique.csv|. You had told me this I am sure, I just forgot when I was using |names_unique.csv|!

How hard would it be (long would it take?) to pull the confidence levels? Actually, for the future, it would be cool to have authors too. But no need for now. This TPL data is a huge resource to have locally.

Reply to this email directly or view it on GitHub < https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45770143 .

Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45770340 .

ejforrestel commented 10 years ago

Oh yeah, go for it! I forgot that this was a separate repo.

Best, Beth

On Wed, Jun 11, 2014 at 1:16 PM, Dylan Schwilk notifications@github.com wrote:

Beth put it up in the repo on the Plants and Fire organization, I can move it over to this repo, I assume that is ok with you Beth? There is some overlap of code and data now between those repos, but I think it makes sense to keep all of the general name matching in this current repo.

-Dylan

On 06/11/2014 12:07 PM, Will Pearse wrote:

Hello Beth too,

Is there any chance of you putting the crawler script up somewhere? Maybe that way we could grab it and not hassle you! :p

Will

On 06/11/2014 12:05 PM, Dylan Schwilk wrote:

Hi Beth,

For names, we are set because you provided all the names. It is just that the "Accepted" and "Synonym"s are in the synonym ragged array (|TPL1_1_synonymy_list|) and the "Unresolved" and "Accepted"s are |names_unique.csv|. You had told me this I am sure, I just forgot when I was using |names_unique.csv|!

How hard would it be (long would it take?) to pull the confidence levels? Actually, for the future, it would be cool to have authors too. But no need for now. This TPL data is a huge resource to have locally.

Reply to this email directly or view it on GitHub < https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45770143 .

Reply to this email directly or view it on GitHub < https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45770340 .

Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-45771447 .

dschwilk commented 10 years ago

@willpearse, @AmyZanne : any update on the manual checking? I've been swamped in lab with processing material this week and am headed out to the Chisos Mountains tomorrow.

I'd like to settle on the final lokup and then pull the actual gbif records for those taxa. It should be really fast to do. I should be able to do that next week when I'm back from the field.

AmyZanne commented 10 years ago

Thanks for the email. Sorry, I haven't kept up with much of late. I won't get to anything until sometime starting in August. Sorry. I'll try to check in then or feel free to ping me again. Will, have you made any headway? I'm in DC for next 5 days so happy to have a quick chat if that helps. Then I'll be largely out of touch until early Aug.

Best, Amy

On Thu, Jun 19, 2014 at 5:02 PM, Dylan Schwilk notifications@github.com wrote:

@willpearse https://github.com/willpearse, @AmyZanne https://github.com/AmyZanne : any update on the manual checking? I've been swamped in lab with processing material this week and am headed out to the Chisos Mountains tomorrow.

I'd like to settle on the final lokup and then pull the actual gbif records for those taxa. It should be really fast to do. I should be able to do that next week when I'm back from the field.

— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-46616242 .

Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/

dschwilk commented 10 years ago

Hi guys,

I've been busy with field and lab work so I did not get to it this week.

If Will has gone through the questionable matches, then great. Otherwise, I plan to make one more run through the matching in which I tag all names that are in TPL and then manually check. THen I can extract the GBIF records for the fire and plants project. It is not that much total work, but I am in the field every Friday-Sunday then lab work Monday-Thursday for June-July so I have not had a chance. I may be able to sneak it in next week.

-Dylan

On 06/25/2014 07:54 PM, AmyZanne wrote:

Thanks for the email. Sorry, I haven't kept up with much of late. I won't get to anything until sometime starting in August. Sorry. I'll try to check in then or feel free to ping me again. Will, have you made any headway? I'm in DC for next 5 days so happy to have a quick chat if that helps. Then I'll be largely out of touch until early Aug.

Best, Amy

On Thu, Jun 19, 2014 at 5:02 PM, Dylan Schwilk notifications@github.com wrote:

@willpearse https://github.com/willpearse, @AmyZanne https://github.com/AmyZanne : any update on the manual checking? I've been swamped in lab with processing material this week and am headed out to the Chisos Mountains tomorrow.

I'd like to settle on the final lokup and then pull the actual gbif records for those taxa. It should be really fast to do. I should be able to do that next week when I'm back from the field.

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-46616242

.
Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/
— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-47177018.

AmyZanne commented 10 years ago

Ok, let's see what Will says. Otherwise let me know once you get through your next round of TPL query. Divide and concur can be good. We did that for our tempo mode group via a shared google doc. I'm off to Oz next week to scope out field site and set up pilot studies with Will C so will be out of touch while there until late Jul. Then I teach a two week intensive course. If I have down time though happy to help out. Happy field work Dylan!

Best, Amy

On Thu, Jun 26, 2014 at 9:46 AM, Dylan Schwilk notifications@github.com wrote:

Hi guys,

I've been busy with field and lab work so I did not get to it this week.

If Will has gone through the questionable matches, then great. Otherwise, I plan to make one more run through the matching in which I tag all names that are in TPL and then manually check. THen I can extract the GBIF records for the fire and plants project. It is not that much total work, but I am in the field every Friday-Sunday then lab work Monday-Thursday for June-July so I have not had a chance. I may be able to sneak it in next week.

-Dylan

On 06/25/2014 07:54 PM, AmyZanne wrote:
Thanks for the email. Sorry, I haven't kept up with much of late. I won't get to anything until sometime starting in August. Sorry. I'll try to check in then or feel free to ping me again. Will, have you made any headway? I'm in DC for next 5 days so happy to have a quick chat if that helps. Then I'll be largely out of touch until early Aug.

Best, Amy

On Thu, Jun 19, 2014 at 5:02 PM, Dylan Schwilk <notifications@github.com

wrote:

@willpearse https://github.com/willpearse, @AmyZanne https://github.com/AmyZanne : any update on the manual checking? I've

been swamped in lab with processing material this week and am headed out to the Chisos Mountains tomorrow.

I'd like to settle on the final lokup and then pull the actual gbif records for those taxa. It should be really fast to do. I should be

able to do that next week when I'm back from the field.

— Reply to this email directly or view it on GitHub

< https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-46616242

.
Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/
— Reply to this email directly or view it on GitHub < https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-47177018 .
— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-47227173 .

Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/

willpearse commented 10 years ago

Sorry to be coming to this late, I've been travelling and I wasn't able to keep up.

Early July (i.e., 4th onwards) is better for me for checking. What I was looking at before looked OK, and I was very happy with what it was flagging as acceptable and unacceptable.

On 06/26/2014 01:54 AM, AmyZanne wrote:

Thanks for the email. Sorry, I haven't kept up with much of late. I won't get to anything until sometime starting in August. Sorry. I'll try to check in then or feel free to ping me again. Will, have you made any headway? I'm in DC for next 5 days so happy to have a quick chat if that helps. Then I'll be largely out of touch until early Aug.

Best, Amy

On Thu, Jun 19, 2014 at 5:02 PM, Dylan Schwilk notifications@github.com wrote:

@willpearse https://github.com/willpearse, @AmyZanne https://github.com/AmyZanne : any update on the manual checking? I've been swamped in lab with processing material this week and am headed out to the Chisos Mountains tomorrow.

I'd like to settle on the final lokup and then pull the actual gbif records for those taxa. It should be really fast to do. I should be able to do that next week when I'm back from the field.

— Reply to this email directly or view it on GitHub

https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-46616242 .
Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/
— Reply to this email directly or view it on GitHub https://github.com/schwilklab/taxon-name-utils/issues/2#issuecomment-47177018.

dschwilk commented 7 years ago

All implemented.