junyinglim / TranscriptResolver

2 stars 0 forks source link

duplicate records #2

Closed JoyceGross closed 8 years ago

JoyceGross commented 8 years ago

I'm getting errors on some records appearing twice in the load file. Here are some example:

Error .... Specimens Worksheet, Row 2895: bnhm_id "EMEC533611" was already used on Row 1872 Error .... Specimens Worksheet, Row 6533: bnhm_id "EMEC528658" was already used on Row 1935 Error .... Specimens Worksheet, Row 7483: bnhm_id "EMEC544886" was already used on Row 985 Error .... Specimens Worksheet, Row 8077: bnhm_id "EMEC528637" was already used on Row 2491 Error .... Specimens Worksheet, Row 9661: bnhm_id "EMEC533992" was already used on Row 3195 Error .... Specimens Worksheet, Row 11743: bnhm_id "EMEC530196" was already used on Row 5521 Error .... Specimens Worksheet, Row 12772: bnhm_id "EMEC515825" was already used on Row 9351

junyinglim commented 8 years ago

This is primarily because I extract the bnhm_ids from the image filenames (which is all I have from the Notes from Nature output)

These EMEC numbers are not unique and refer to presumably two different specimens... EMEC533611 Lasioglossum ovaliceps.jpg EMEC533611 Lasioglossum incompletum.jpg

Are these all the examples that your bulk upload error checker highlights?

Like the date issue, should I randomly remove one of the duplicates, both, or maybe even parse them out as a separate sheet?

JoyceGross commented 8 years ago

Ok, this is an error with the image files that probably (but not necessarily) was fixed in the Essig database, and, fixed or not, persisted in the NfN copy of the images. Yes, images with the same numbers were uploaded, one in error. Since it's not easily possible to figure out which record/file is correct, I lean towards removing both of them. I'll run this by Pete too but I just don't see another way around it that isn't really messy and probably not worth the effort. The 2nd best option would be to randomly pick one record and remove the other. Will get back about this.

junyinglim commented 8 years ago

I've updated the transcriptClean script to just remove both duplicates. You'll need the latest version of pandas for this (v 0.17), which you can download from their website.

Let me know if that solves the issue

poboyski commented 8 years ago

Please send the error log files to me and I will figure out which are the proper images and which are duplicates, then database them manually.

junyinglim commented 8 years ago

Duplicate records are now omitted from clean_transcript, and their bnhm_ids will be logged in a new file called error_transcript.csv