Closed DeirdreLoughnan closed 3 years ago
I emailed the database manager yesterday, his response is below
"Dear Darwin TRY does not remove duplicates from submitted datasets. Duplicates are however marked in field OrigObsDataID. Please check the release notes! We are currently working on R code to evaluate TRY data output. It is not ready for publication however. With my compliments Gerhard"
Thanks Darwin for contacting TRY.
Unfortunately many of the rows of duplicated data in our dataset appear to not to have OrigObsDataID's!
@DarwinSodhi @DeirdreLoughnan Thanks to you both! It's nice to have such a quick reply. Perhaps many of these values are just replicates of the same species x same everything else? Does TRY have such data?
Also, @DeirdreLoughnan how similar do values have to be to be considered dups? In looking here I could not tell. Can you point me to the line number?
@lizzieinvancouver my understanding is that the data has to be identical, the relevant code is lines 53 to 60. I just copied this code from lines 23 to 43 from the code here
@DeirdreLoughnan You're right! I was think of a different cleaning script (clean_bbperctodays.R).
@DeirdreLoughnan Can you provide a total n rows of data to start with, n rows lost in the cleaning of duplicates and of those n rows how many had a duplicate flag?
@lizzieinvancouver here are the values: n rows of original data: 1258659 n rows lost by removing dup: 434905 n rows flagged as dup: 434905
@DeirdreLoughnan So, all the rows removed as duplicate were flagged as duplicates?
Yes, unless there is something I am missing, all the rows flagged by line 47 as being duplicates are the same as the rows being removed.
@DeirdreLoughnan Well, that's good news for TRY and for our coding abilities.
Sorry I misinterpreted what you meant by flagged. If you meant the ones that have the same OrigObsDataID's as the rows flagged as duplicates, then the numbers would be:
n rows of original data: 1258659 n rows lost by removing dup (ie flagged by us): 434905 n rows flagged as dup by try: 429877
So there is a difference of 5028 rows between what we flag as being duplicated and what try flagged.
@DeirdreLoughnan Ah, okay! Still, it's not as bad I expected.
Thanks @DarwinSodhi for contacting TRY to ask them about the inclusion of duplicated data and their stance on data quality.
I look forward to hearing their response!