junyinglim / TranscriptResolver

2 stars 0 forks source link

date issue #3

Closed JoyceGross closed 9 years ago

JoyceGross commented 9 years ago

Also getting some date errors (this is the complete list):

Error .... Specimens Worksheet, Row 3336: Day Collected entered with no Month Collected.
Error .... Specimens Worksheet, Row 4510: Day Collected entered with no Month Collected.
Error .... Specimens Worksheet, Row 7625: Day Collected entered with no Month Collected.
Error .... Specimens Worksheet, Row 23360: Day Collected entered with no Month Collected.
Error .... Specimens Worksheet, Row 24990: Day Collected entered with no Month Collected.
Error .... Specimens Worksheet, Row 26274: Day Collected entered with no Month Collected.
Error .... Specimens Worksheet, Row 39653: Day Collected entered with no Month Collected.

junyinglim commented 9 years ago

Can you provide the bnhm_id so i can see what's wrong with the resolving process?

JoyceGross commented 9 years ago

Yes! Here they are. EMEC524580 EMEC529396 EMEC505417 EMEC529406 EMEC594319 EMEC529410 EMEC529070

junyinglim commented 9 years ago

I realize it might be easier if you just emailed me the "clean" file. I noticed your error messages provide a row number, so I potentially cross reference it. Might save you some time looking the ids up!

JoyceGross commented 9 years ago

Good idea. Because of the large size of the file (~19MB) rather than email it I've put it here for you to download: http://essigdb.berkeley.edu/clean_transcript.xls (I created clean_transcript.xls from clean_transcript.csv. It's the same data as in clean_transcript.csv except I edited the UC Riverside holdingInstitution and changed CIS->UCIS.)

junyinglim commented 9 years ago

I've had a look at the images to see why our transcribers are having trouble with dates for these specimens, and it's pretty obvious - the label writing is APPALLING!!

Would you like me to chuck these incomplete dates out completely? It feels like we'll be chucking away the information that they did get right though. It also does seem like they are fairly rare; perhaps somebody doing the bulk uploading will have to fill in the missing information themselves by checking back to the image? It's up to you how we wanna deal with this.

JoyceGross commented 9 years ago

I say chuck those dates. It's ok if there is a month and not a day, but not ok if there is a day and not a month. So maybe just chuck the records with days but missing months? Since there aren't that many records with this problem I don't think we'll be throwing away too much.

junyinglim commented 9 years ago

Ok done. Latest version of transcriptClean should chuck the "day" field if the "month" field is empty. I'll close this issue once you've tested it.

poboyski commented 9 years ago

Are the errors getting logged somewhere? It is fine to drop the dates with no months, but with the error logs I can assign someone to check them manually.

JoyceGross commented 9 years ago

The records won't load at all with a bad date. However, Jun may be able to write the bnhm_id to a log when encountering a bad date. Something like "EMEC529410 has a day but not a month. Collecting date for this record has been chucked."

poboyski commented 9 years ago

I think we all agree that it is not worthwhile for Jun to write exceptions for all quirky data. I am happy to have a log file that provides BNHM_ID and Problem_Description. For example, "EMEC529410, Date issue", "EMEC12345, County issue". Preferably the offending field will be chucked, but the rest of the record will get processed.

junyinglim commented 9 years ago

Any quirky date errors will be logged in a new file called error_transcript.csv. But the offending record itself will not be chucked (only the field in question). See transcriptClean changes for more details