automatic MC extraction: quality control issue - Githubissues

gsautter / goldengate-imagine

Automatically exported from code.google.com/p/goldengate-imagine

Other

1 stars 0 forks source link

automatic MC extraction: quality control issue #19

Closed myrmoteras closed 7 years ago

myrmoteras commented 7 years ago

@gsautter did you check the results from the automatically extracted materialsCitation? How many of the MC are correct, how many are wrong, how many are not found?

Within the MC: how many of the elements are properly detected? how many are wrong?

How can I get all the MC of one day?

Or how do you suggest we can check this?

Though it is a cool tool, it has a big potential to wreck out reputation if we produce too many wrong MC. I want to be sure, we deliver high quality data.

gsautter commented 7 years ago

You can get all the MCs from the stats, filtering on the day of first upload. Just have to select all the fields in Materials Citation Data.

gsautter commented 7 years ago

As to the quality, hard to tell. Weren't the dashboards supposed to check plausibility?

myrmoteras commented 7 years ago

How do you check whether the method works to a certain level?

gsautter commented 7 years ago

The parts that work properly are surely the dates and coordinates, and in most parts the specimen counts, countries, and regions (as long as they are in the gazetteer), and elevations. Collector names also tend to work well if labeled, and so do collection codes if not all too exotic. Location names are the most critical parts, as they are the most diverse.

gsautter commented 7 years ago

As to evaluation: We tested it on a pretty diverse set of documents, and deployed it only after we were satisfied with the results. I occasionally check when tracking down figures in the XML view, and it mostly looks good.

If you have any bad errors, just let me know so I can investigate.

gsautter commented 7 years ago

No bad news on this one in a long time ... guess it's time to lay it to rest.

myrmoteras commented 7 years ago

no bad news means that nobody really checked. We shiould in fact examined this data!

gsautter commented 7 years ago

Well, I do check whenever I open a page on TreatmentBank for whatever reason ... and didn't really find any problems thus far.

Also, an open ticket way down the long tail is a pretty bad reminder for a routine task that we should be doing. That belongs on some (daily) checklist, as checking upon the server is on mine.

I really think tickets in trackers like this one are intended for immediate issues that need solving, and the ticket can be closed afterwards. Any mis-parsed materials citations you find in a daily check would qualify there, for instance, as they mean I have to re-visit the parser. But a note intended to remind you to check for such issues on a daily basis does not.