traitecoevo / austraits.build

Source for AusTraits
Other
16 stars 2 forks source link

Remove duplicate flora data & add authorship to flora data #791

Closed ehwenk closed 1 month ago

ehwenk commented 4 months ago

Two pieces of work on this branch that cause most data.csv files for datasets with flora data to be completely overwritten:

  1. remove all woodiness, growth form, life history from the "original" flora scrapings, since we have complete trait value datasets for these traits (most common error here are "vines that climb to tree tops" being designated as trees, but there are others)
  2. remove all taxon_name x trait_name x dataset_id that are in "original" and "new" scraped datasets; there are indeed updated values for a number of numeric traits and in the ~100 profiles I've looked up where there is a difference between old and new, only 1 mistake in the newer versions. That said, the "differences" are the absolute minority - for trait x taxon x dataset values in both old and new flora extractions 98+ % are identical.
  3. retain all categorical data that is only in the "original" scrapings (except the three complete traits). I've spot checked lots of values and haven't found any errors - and other than growth form, woodiness, life history there isn't much overlap in the categorical traits scraped in the "original" and "new" flora datasets
  4. For numeric traits, for trait x taxon x dataset combinations that are only in the "original" scrapings, I manually checked every data point (~8000 values across all floras) and manually correct or dismissed incorrect values.

Overall, this has removed ~100,000 data points. These are almost entirely true duplicates:

nrow(austraits_develop$traits) [1] 1813898 nrow(austraits_removed$traits) [1] 1706226

dfalster commented 4 months ago

Nice work @ehwenk. This is quite a big PRE so we'll need to look at together sometime