globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
18 stars 3 forks source link

Parsing Names in a Hierarchical Fashion #128

Open jtmiller28 opened 1 year ago

jtmiller28 commented 1 year ago

So I thought I'd leave the framework for a work in progress idea I have for resolving large datasets with inclusions of complex names here, let me know if there are any complications with my reasoning or thoughts about it as a whole @jhpoelen @seltmann

For some background, the underlying issue I have been constantly encountering when resolving names is that the complexity of the name provided by the source doesn't always match the complexity in the catalogue. This shows up in infraspecific epithets, varieties, subsp. etc and has a lot to do with individual decisions in species concepts and what not. I believe this to be a significant issue when trying to create a reusable dataset since it forces the creator of the dataset to pick and choose what level they think would be appropriate for taxonomic specificity, which may not match the catalogues registered names and can thereby result in a loss of resolvedNames by that choice alone.

As an example...for my analysis I endeavor to create a dataset that includes all levels of designations, provided that they are accurate. The underlying problem here is however that alot of names include pieces that are not found in the World of Flora catalogue and result in a lack of resolution since there is no fuzzy matching scheme. The dangers of fuzzy matching when applied to a large set of data are apparent from my past resolutions, therefore I believe this should be avoided unless there is appropriate suspicion for names.

An alternative approach suggested by one of my labmate is to construct a for-loop that utilizes the resolver to give the lowest (taxonomic) level of designation possible given the parsed levels of names.

Here, I'll showcase how when using a parser you can get a name that wont resolve correctly based upon my choice of WFO catalogue.

echo -e "\tZea mays subsp. mays raza onaveño" | nomer append gn-parse returns: Zea mays subsp. mays raza onaveño SAME_AS Zea mays ssp. mays raza onaveno

however... echo -e "\tZea mays ssp. mays raza onaveño" | nomer append wfo Zea mays ssp. mays raza onaveño NONE Zea mays ssp. mays raza onaveño

while echo -e "\tZea mays subsp. mays" | nomer append wfo Zea mays subsp. mays HAS_ACCEPTED_NAME WFO:0001096778 Zea mays subsp. mays subspecies Angiosperms | Poales | Poaceae | Zea | Zea mays | Zea mays subsp. mays WFO:9949999999 | WFO:9000000415 | WFO:7000000483 | WFO:4000041167 | WFO:0000907754 | WFO:0001096778 phylum | order | family | genus | species | subspecies http://www.worldfloraonline.org/taxon/wfo-0001096778

The issue here is that the cultivar is not a recognized name in the naming scheme used by wfo. After talking with some plant biologist, cultivars are not relevant when thinking from a species concept perspective. Therefore we are losing data due added complexity of the name. Essentially our choices are either we could limit the complexity of ALL names to just species level and retain this species or we could identify cases of cultivars and remove them from the dataset or somehow notate them so they parse out (I believe there is an option for this in gn-parser), however, it seems more logical to apply an approach that circumvents all of these types of issues rather then patch work through each one we encounter as name complexity increases.

The overall problem is that the complexity in names warrants us to arbitrary choose the level a name should be parsed to, which can result in the loss of data. The goal should be to resolve names in our dataset to the lowest (taxonomic) level possible and label at which level this resolution happens

The concept would go as follows: Run names through a parsing service and retrieve all of the variations of the name through parsing. This would result in the parsed names according to the scheme (includes subsp., var., and cultivars), the authorship, and finally 2 columns broken up with genus and specificEpithet.

The outputs of gbif parsing show... canonicalNameWithMarker = Names including subsp., var., etc canonicalName = Names without specifying subsp. var. etc (but still include names) ex: Zea mays subsp. mays -> Zea mays mays genus = parsed out genus specificEpithet = parsed out specificEpithet Authorship = parsed out Authorship

We can then run this parsed name table through the following for-loop scheme. The goal would be to find the name at the lowest level of taxonomic resolution + authorship if possible. Failing that, we want to take successive steps back till resolution occurs and label at what level that name was capable of being resolved (if at all).

boolean found = false if (authorship exists) found = try canonicalNameWithMarker + authorship if (!found) found = try canonicalNameWithMarker if (!found AND canonicalNameWithMarker != canonicalName) found = try canonicalName if (!found) found = try genusOrAbove + specificEpithet

This is still a rough framework, and I'm sure it has simplifications I am currently overlooking. However I think the overall goal is something that might improve the construction of datasets that include aggregated names, as currently choosing the one type of parsed name seems to cause issues in the dataset. Rather then losing data with being overly specific, It seems to make sense to me that finding resolution at just the species level would provide relevant information about the occurrence.

Let me know your thoughts and especially criticisms!

jhpoelen commented 1 year ago

@jtmiller28 thanks for sharing your ideas. I think you carefully described taxonomic parsing approach makes a lot of sense.

As I was thinking about this over the last couple of days, I was wondering about how this proposed parsing workflow would change the taxonomic perspective on a dataset.

In a way, depending on the provided name, you'd apply a different taxonomic perspective: for some names, you'd find a match using: try canonicalNameWithMarker + authorship, whereas for other names, you'd try found = try genusOrAbove + specificEpithet. I assume that different matching error estimates (e.g., false positive rate) apply to the different matching algorithms. How would you estimate the matching errors in case of your assembled parsing strategies?

jtmiller28 commented 1 year ago

Good point @jhpoelen I see what you mean by having different levels of resolution being present in a dataset might be risky. I would argue that as the providers of the names, being aggregated together, cannot supply cohesive names across the board (some may elect to use subsp., cultivars, and varaties, some don't include authorship, etc.) that this method is more applicable then just losing data due to choosing an arbitrary level of resolution. This chance of failing resolution increases depending on the catalogue of choice, as we have seen notation/standards for names outside of just genus and specificEpithet differ.

The risk is still apparent, however; for example if a provider listed a name out to a variety but resolution fails at that level in WFO matching it will be only caught at the genus + specificEpithet level. This wouldn't be as the provider intended, but is it better to remove this record completely as would occur if no match is found? I would argue that provided the scope is still within the species level, that there is still relevant biology to be described by taking a step back recovering the name to genus + specificEpithet vs removal of the name.

I will keep this in mind and hopefully devise a good tagging system for labeling at which point resolution occurs per name in the matching scheme. Maybe addition logic is required such as assessing whether the canonicalName, canonicalNameWithMarker, and canonicalName + Authorship are = to better track when name resolution happens downstream of the originally intended name.

Thanks for your thoughts! I'll keep this in mind and hopefully have some numbers on these downstream parsed names for you in the near future.

jtmiller28 commented 1 year ago

Update with some Quantitative results:

I proceeded with this idea and found some results of interest. The logic was to proceed step by step through the parsed names offered by gbif-parse using lower levels of complexity for resolution of each name. Using Coalesce through SQL a new table was constructed pulling the first name nonNull result and tracking at which step this resolution occurs. image

Applying this to my dataset with ~140,000 unique plant names to align, the following percentages represent how many names were resolved using this coalescent method. One thing of note is that names that were considered unparsable by gbif-parse such as unrecognizable complex hybrids were removed and not accounted for here. image

Most interestingly, the majority of names (>75%) are resolved in their most complex form with and without authorship. A remaining ~9% of the data is resolved by removal of the marker or simplifying the names to just genus + specificEpithet.

Diving deeper into authorship, ~80% of the unique names in this dataset had some form of associated authorship; however, these results make it apparent that authorship was a causative factor in name resolution failing. This is likely due to heterogeneity in authorship standards concerning punctuation, order of authors, etc when using exact matching schemes. One easy repair that was incorporated into this analysis was removal of white space, since my preliminary work showed that white space fixed 5% of the subset of names to pilot the script. Further work in fixing authorship would be a rigorous task in itself...considering that we would have to align all variations of authorship in aggregated datasets to be synonymous with the catalogues standard ( A mini version of what we are already doing with the scientificName!).

My biggest concern in terms of name alignment quality with this method would be the reduction of name complexity resulting in inappropriate name matching. A case for this could be that a lower level of taxonomic complexity (e.g. genus + specificEpithet) will map to a unintended synonymous name compared to the providers intentions. I have yet to find an example of this, but it seems to be an appropriate concern when considering that this method can simplify name complexity compared to the providers intentions.

Looking at just measuring the amount of change from verbatim -> resolved names in my dataset, I looked at the percentage of change through the whole dataset using the World Flora Online Catalogue: verbatim -> accepted = 44.9% Names changed

And just looking at how many species (something in either/both of genus + specificEpithet changed): verbatim -> accepted = 29% Names changed

Overall these percentages seem rather high; however, GBIF's name resolution (based on their catalogue system) also imposes large quantities of change as well (38.1% for full names) & (27% for species names).

These comparisons aren't very helpful however since I am using wfo instead of their catalogue. A better comparison would be using gbif's catalogue through Nomer and then comparing the results.

So far this looks promising for dealing with complex aggregations of names, 87.9% of the names in my dataset we're resolved this way before introducing fuzzy matching which is great progress considering the complexity in infraspecific information in plant names.