inaturalist / iNaturalistMLWork

0 stars 0 forks source link

Generate mapping of model taxa to current taxa #33

Closed pleary closed 4 months ago

pleary commented 4 months ago

iNaturalist taxa are can be modified at any time. Taxa may be split into multiple taxa (preserving the original or not), merged with other taxa, replaced 1-1 with another taxon, or removed with no replacement. The taxonomy files associated with models represent taxa at a fixed point in time.

The goal of this ticket is to have a script (ideally in the rails repo, possible the node API repo) that will generate these mappings given the path to a taxonomy file, and the date that file was generated. It should produce a CSV file that contains taxon mappings with the original taxon_id and replacement taxon_ids (which may include the original, may include multiple taxa, or may indicate there is no mapping and the taxon is inactive with no replacement), based on taxon changes made since the taxonomy file was generated. Also included should be the other fields in taxonomy.csv we expect for various calculations (parent_taxon_id, rank_level, name). Potentially that would need to be in a separate file.

An edge case to consider is when there is a new taxon mapping, but the parent of that new taxon isn't already represented in taxonomy.csv, and we'd want metadata about the parent (its parent_taxon_id, rank_level, and name). If its confusing to include both taxon mapping and metadata about new taxa and their novel ancestors in a single file, multiple files would be OK.

As an example of why we should consider the time the taxonomy.csv file was generated, consider a taxon that was split into two (itself and a new taxon). If this change was committed before the taxonomy file was generated, that information would be incorporated into taxonomy.csv, and we would not want the second split taxon listed as a mapping created for that taxonomy.csv.

This CSV file can later be generated regularly (daily, hourly, etc) and used in static implementations that don't have database access to perform these queries dynamically (the python API, mobile apps, etc)

pleary commented 4 months ago

Script was added in https://github.com/inaturalist/inaturalist/commit/c1448759622dedc7533424d21b5ef1ba93584d2c . It generates to files - synonyms.csv and synonym_taxonomy.csv. The former says what are the replacement taxon_ids for a given taxon in the model. There may be 0 (the taxon was removed), 1 (the taxon was replaced), or many (the taxon was split). If there are many replacements, the original taxon_id may be one of them.

The taxonomy file includes the entire model taxonomy recreated with ancestry information at the time of generation, including all original model taxa as well as all new synonyms. It was easier to recreate the whole thing rather than list out all the changes, which could be many, including taxa that didn't have synonym changes. The updated taxonomy can be used in place of the taxonomy generated at model data export time.