Open 1313ou opened 2 weeks ago
It would be informative if you share the code used to automate the fixes. For instance, how did you managed
Care has been taken to leave initial upper case where required like in 'God's mercy'
@arademaker , you'll find here the classification on which this formatting is based (see second sheet for more explanations).
Basically it's a Stanza-assisted 'hand'-review of all examples. Stanza's deep dependencies and constituency dependency (in the last two columns) are designed to flag certain conditions but a great number of their findings are overridden by the review.
Dependencies are analyzed along these lines: find a verb phrase, find if it has a subject, consider the mood feature ... etc to determine if the input is a sentence.
A column is dedicated to directions that override the standard formatting behaviour (like in 'God's mercy')
The (limited) redundancy doesn't mean it's fool-proof: 49k examples are a lot to review and errors are bound to slip in. Fixes will be welcome.
This is a large PR and I would flag that it is automatically constructed, which is something that we advise against in our contribution guidelines. I would probably reject this from a new contributor, but as @1313ou has made many good PRs, I trust that the quality of this contribution.
A quick check shows that 1,323/49,638 (2.6%) of examples end with a period and 11,337/49,638 (22.8%) of examples start with a capital letter. As such, it seems that we have an inconsistency that this PR would improve.
I would choose to accept this, but I will leave it open to other community.
One has perhaps noted how punctuation and capitalization of examples appear to be sloppy.
This PR is to remedy this, based on a classification of examples. Classification was partly automated with syntactic dependencies as provided by spaCy then stanza and analyzed by and ad-hoc algorithm and cross-checked by 'hand'-review. Unfortunatly deep models were largely inefficient: I didn't find one trained on dictionary data, and the others didn't perform well.
Examples are split into:
The dividing line is sometimes hard to draw between verb phrases and imperatives like 'treat the infection with antibiotics' which could be an instruction (imperative) or just a verb phrase expressing collocations, usually object complements. Fortunately such words as 'you'. 'your'... favor imperative classification while 'one', 'one's' ... favor classibication as verb phrase. Sometimes an imperative context can hardly be thought of (like for 'square the circle')
As this is partly automated and the volume reviewed is huge, errors must have sneaked through because of misassessment, errors or simply fatigue. This is inevitable.