Closed dfalster closed 5 years ago
Possible solutions are
@ehwenk and I prefer the 2nd option because
OK - I agree about the de-duplication step as a good option. As far as I can see, a lot of these are happening because they are categorical (e.g. growth form, life history) and this is a fairly fixed trait across the published floras of Australia. I am not sure that these really qualify as duplicates in the true sense, I think they are replicates across sources. The 'true' duplicates are much more likely to be the SLA measures and these will really benefit from a de-duplicaiton step.
Ok. I suspect with the floras there's a certain amount of repeating going on. But in any case, agreed on SLA. Some quick stats suggest maybe 381 / 4273 records are duplicates
> x$traits %>%
+ mutate(check = paste(trait_name, species_name, value), dup = duplicated(check)) %>%
+ arrange(check) %>%
+ filter(check %in% .$check[.$dup]) -> z
>
> z %>% filter(trait_name == "specific_leaf_area") %>% pull(dup) %>% sum()
[1] 381
>
> austraits$traits %>% filter(trait_name == "specific_leaf_area") %>% nrow()
[1] 4273
10% is not too bad, especially if we can implement an easy way to flag these. The issue will then be how users decide to attribute the observations, I guess.
@ehwenk has been looking for duplicates and suggested a more systematic check. Seems we have ~40k!!
To get this number i pasted together trait_name, species_name, value and then looked for mateches.
In total looks like there ~40000 duplicate records
Here are the traits with most overlap:
Here are the studies with some overlap to another (or their own)