tmesaglio / Australian-plant-photos

2 stars 1 forks source link

synonomy system #7

Closed wcornwell closed 2 years ago

wcornwell commented 2 years ago

Are we correcting the iNat names to APC? before the match? what's the best way to do this?

tmesaglio commented 2 years ago

let me think about this now. My immediate thought was yes, that would be the best, but I need to figure out the most efficient way to do it

wcornwell commented 2 years ago

cool. we just need it to be in code, so that we can scale up to Australia without too much trouble

wcornwell commented 2 years ago

btw, for something this complicated, it's typical to write pseudo-code first

tmesaglio commented 2 years ago

ok I have no clue if what I'm proposing here is actually possible in R, but this is what I think would be the most efficient method

  1. Extract the following columns from the original APC file: nameType, acceptedNameUsage, taxonomicStatus, scientificName, canonicalName and taxonRank
  2. Within taxonRank, filter to only include species, forma, varietas, subspecies; these are the ranks which could contain iNat synonyms.
  3. From here, we make two APC files. The first one we filter taxonomicStatus to accept all values except 'accepted'. This now = 'APC1', and is the synonyms file.
  4. Duplicate APC1, but filter taxonomic status to only include 'accepted''; this is now 'APC2'.
  5. Make dataframe from the iNat download; single column of observed species.
  6. Do some kind of matching exercise between the iNat and APC2 files, whereby, row by row, we check to see if the iNat name matches to a name in any of the rows/cells in the canonicalName column. This matching process creates a second column for the iNat dataframe; If there is a one-to-one match, it gives it a yes/1/positive indicator. If no match, then a no/0/negative indicator.
  7. We then filter to just the NOs/zeroes in the iNat file, and then do the same matching exercise, but comparing with APC1's canonicalName column. When it finds a match, it creates another column in the APC file, with each row equal to the accepted name for the synonym that got matched.
  8. That should just leave a very small amount of entities we'd have to do manually, e.g. unregistered orthographic variants.

Do this all make sense and is what I'm describing actually doable?

wcornwell commented 2 years ago

looks good and very possible in R!

Next step is to do a few (~5) by hand--so we can test if the (yet to be written) code is working. Then we turn each step into a line of code.

Bummer about the next bioblitz, should we re-institute the friday meeting?

tmesaglio commented 2 years ago

yeah I think reinstitute for this week at least

tmesaglio commented 2 years ago

I've now pushed an updated version of the 'synonym cleaning' script that has code for steps 1-5 above, plus the random sample code. Just need code for steps 6 and 7

wcornwell commented 2 years ago

I think 6 and 7 might be a situation for https://www.statology.org/dplyr-case_when/

tmesaglio commented 2 years ago

looking at a lot of tutorials, and almost all seem to be for numerical values; can this be used for text strings?

wcornwell commented 2 years ago

yes, just need a few more "vocab words"

i'd use method 2 here: https://www.geeksforgeeks.org/how-to-test-if-a-vector-contains-the-given-element-in-r/

tmesaglio commented 2 years ago

ok I can now get it to simultaneously check all accepted names and synonyms and match them to the iNat names, giving a 'yes' if matched to accepted, a 'yes2' if matched to a synonym, and a 'no' if matched to nothing using this:

image

what I haven't figured out is how to, instead of putting a 'yes2', to insert the accepted name correlating with that synonym

tmesaglio commented 2 years ago

running this code for the entire Tassie dataset, we get:

1307 names perfectly matched 51 names for which the iNat name is treated as a synonym by APC 34 names for which there was no match with the APC

wcornwell commented 2 years ago

Looks like there are a fair few recent garden escapes in Tassy of plants that are native to other parts of Aus! Interesting that iNat is picking them up.

Let's discuss next steps tomorrow!

tmesaglio commented 2 years ago

We are just about all systems go for the whole dataset. Check out my latest push (synonym-cleaning v2), the code in which now produces a dataframe that looks like this: image image

tmesaglio commented 2 years ago

I also pushed an excel file ('Tassie checks') that contains all the species in the above dataframe that I had to manually check, and my recommendations for action. A few of these I'd like to check with you guys before we upscale to all of Australia