lizzieinvancouver / ospree

Budbreak review paper database
3 stars 0 forks source link

Contacting TRY about data quality #396

Closed DeirdreLoughnan closed 3 years ago

DeirdreLoughnan commented 3 years ago

Thanks @DarwinSodhi for contacting TRY to ask them about the inclusion of duplicated data and their stance on data quality.

I look forward to hearing their response!

DarwinSodhi commented 3 years ago

I emailed the database manager yesterday, his response is below

"Dear Darwin TRY does not remove duplicates from submitted datasets. Duplicates are however marked in field OrigObsDataID. Please check the release notes! We are currently working on R code to evaluate TRY data output. It is not ready for publication however. With my compliments Gerhard"

DeirdreLoughnan commented 3 years ago

Thanks Darwin for contacting TRY.

Unfortunately many of the rows of duplicated data in our dataset appear to not to have OrigObsDataID's!

lizzieinvancouver commented 3 years ago

@DarwinSodhi @DeirdreLoughnan Thanks to you both! It's nice to have such a quick reply. Perhaps many of these values are just replicates of the same species x same everything else? Does TRY have such data?

Also, @DeirdreLoughnan how similar do values have to be to be considered dups? In looking here I could not tell. Can you point me to the line number?

DeirdreLoughnan commented 3 years ago

@lizzieinvancouver my understanding is that the data has to be identical, the relevant code is lines 53 to 60. I just copied this code from lines 23 to 43 from the code here

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan You're right! I was think of a different cleaning script (clean_bbperctodays.R).

@DeirdreLoughnan Can you provide a total n rows of data to start with, n rows lost in the cleaning of duplicates and of those n rows how many had a duplicate flag?

DeirdreLoughnan commented 3 years ago

@lizzieinvancouver here are the values: n rows of original data: 1258659 n rows lost by removing dup: 434905 n rows flagged as dup: 434905

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan So, all the rows removed as duplicate were flagged as duplicates?

DeirdreLoughnan commented 3 years ago

Yes, unless there is something I am missing, all the rows flagged by line 47 as being duplicates are the same as the rows being removed.

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan Well, that's good news for TRY and for our coding abilities.

DeirdreLoughnan commented 3 years ago

Sorry I misinterpreted what you meant by flagged. If you meant the ones that have the same OrigObsDataID's as the rows flagged as duplicates, then the numbers would be:

n rows of original data: 1258659 n rows lost by removing dup (ie flagged by us): 434905 n rows flagged as dup by try: 429877

So there is a difference of 5028 rows between what we flag as being duplicated and what try flagged.

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan Ah, okay! Still, it's not as bad I expected.