openeventdata / UniversalPetrarch

Language-agnostic political event coding using universal dependencies
MIT License
18 stars 9 forks source link

Separate out "must pass" and accuracy assessment tests #54

Open ahalterman opened 5 years ago

ahalterman commented 5 years ago

We're in a netherworld right now of intermingled unit tests and accuracy assessment tests. Some tests measure whether the program is functioning at all, while others are more like measures of real-world performance. It would be really useful to separate these out: UP should have a test suite that new changes require 100% success on before they can be merged, and then a separate, GSR-based set that measures changes in expected performance.

PTB-OEDA commented 5 years ago

Do we have a breakdown of which test records fall into each class? Who handles making this split in the EN, ES, and AR records?

On Sat, Nov 10, 2018, 13:39 Andy Halterman <notifications@github.com wrote:

We're in a netherworld right now of intermingled unit tests and accuracy assessment tests. Some tests measure whether the program is functioning at all, while others are more like measures of real-world performance. It would be really useful to separate these out: UP should have a test suite that new changes require 100% success on before they can be merged, and then a separate, GSR-based set that measures changes in expected performance.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openeventdata/UniversalPetrarch/issues/54, or mute the thread https://github.com/notifications/unsubscribe-auth/AJrP1mdNxa1-Gxo82efc1dfqtnnhL0oMks5utyt2gaJpZM4YYJUi .

ahalterman commented 5 years ago

I think the tests here are the must-pass, since that's the purpose they served in TAB, Petr1, Petr2, etc. There are also some sentences in a separate file here that are must-pass. I also thought there were unit tests for the individual methods in UniversalPetrarch but I'm not finding them (example from Mordecai). Any commit to the code should keep a 100% pass rate on all of these.

"Accuracy assessment" would mostly the the GSRs, along with any sentences in the "Test Suite" that we've decided are no longer "must pass". The phrase info in #44 would also be in this category. Any change to the code should at a minimum not decrease accuracy on these records.

Regarding English vs. Spanish vs. Arabic, the first set should ideally be language-agnostic. The second set will of course be language specific, but we've already got all of those finished and separated out for each language.

philip-schrodt commented 5 years ago

In the KEDS/TABARI/PETR-1 lineage, everything in the validation suite (eventually about 250 cases for TABARI and PETR-1) was a "must-pass" and while most of these pre-dated the wide-spread use of GitHub and certainly of the combination of automated testing and commits, during the development and any subsequent changes, the programs were expected to cleanly run through all of the cases with a 100% pass rate before those changes were considered okay. Most of the cases are, effectively, unit tests produced when various features (e.g. patterns, compounds) were being developed; the remainder are quirky cases we encountered over the years that caused the program to freeze or crash under odd syntactic situations. None of these were GSRs -- in fact a lot of them are completely artificial and are not even grammatically correct English -- and after the first years of the KEDS project (the work that resulted in the 1994 ISQ and AJPS papers), we didn't have the funding sufficient to produce GSRs (plus there were the IP issues)

As best I can tell, Clayton started with generating formal unit tests -- that is, tests oriented to very specific functions -- in PETR-2, then transitioned to something closer to the TABARI validation approach (lots of artificial, though now grammatical, sentences clearly design to test very specific functions), and finally there are a few Gigaword cases with real news articles. But, like everything in PETR-2, this was never completed in any sort of comprehensive fashion: PETR-2 is more of a proof-of-concept than a fully functioning coder.