jonorthwash / ud-annotatrix

GNU General Public License v3.0
61 stars 49 forks source link

Stress testing benchmarks #316

Closed keggsmurph21 closed 6 years ago

keggsmurph21 commented 6 years ago

using the script added in f29ad85, I downloaded all of the treebanks (312 in total, names and sizes listed here from the universal dependencies home page and tried to parse them with the notatrix tool (this is the same pipeline used for /upload and for parsing the textarea).

on first try, 297 passed, 3 timed out (i.e. took longer than 60 seconds), and 12 failed for one reason or another. once i figure out why they failed, i'll add to this issue

timing out:

failing:

so maybe not so surprisingly, the timeout failures are the biggest corpora in the data set. this is where it would be nice to have the sqlite3 interface

keggsmurph21 commented 6 years ago

so bm_crb-ud-dev.conllu looks like it was just a formatting issue with their data, so i submitted a pull request here

keggsmurph21 commented 6 years ago

for fr_gsd-ud-train.conllu it seems like the file just got corrupted while it was writing to disk ... weird

keggsmurph21 commented 6 years ago

fro_srcmf-ud-dev.conllu was an issue with the sentence splitter (fixed in notatrix commit 79df2be)

keggsmurph21 commented 6 years ago

hy_armtdp-ud-train.conllu is working now, so it was probably the same issue as fro_srcmf-ud-dev.conllu

keggsmurph21 commented 6 years ago

ja_gsd-ud-new_train.conllu was an issue with their dataset, check out this pull request for details

keggsmurph21 commented 6 years ago

myv-ud-dev.conllu, see this pull request

keggsmurph21 commented 6 years ago

myv_ChetvergovJevgenij_Velenj-vajgeljtj_1992_UD-dev-2011.conllu, see this issue

martinpopel commented 6 years ago

I agree pushing the treebank maintainers to fix CoNLL-U format errors via GitHub issues may be helpful (and PRs are even more helpful, if the given treebank has "Contributing: here" in README.txt). Just for your info: most of these errors (all I checked) should be detected by validate.py and these errors are already reported at http://quest.ms.mff.cuni.cz/cgi-bin/zeman/unidep/validation-report.pl. Only treebanks with no format errors can be released in UD. If you want to work with the released treebanks, use the "master" branch instead of "dev" or download UD2.2 from http://hdl.handle.net/11234/1-2837 (or wait until November 15, 2018 for UD2.3). See also http://universaldependencies.org/release_checklist.html#validation

That said, if you spot any format errors which are not detected by validate.py (or content errors not detected by Udapi, but easy to detect automatically without many false alarms), please fill a new issue/PR to improve these validators - or improve the UD guidelines to make it clear what is valid and what not.

keggsmurph21 commented 6 years ago

myv_jr-ud-dev.conllu is basically the same as the other myv*.conllu files

keggsmurph21 commented 6 years ago

@martinpopel well the reason i am using the dev branches is because i intend this to be a tool for developers. i'm working on handling parse errors better internally, but i figured i might as well push my discoveries upstream when they're obviously just spelling/formatting errors

i appreciate the feedback though :) i'll check out those other tools

keggsmurph21 commented 6 years ago

swl_sslc-ud-*.conllu all pass now, must have fixed whatever the issue was

keggsmurph21 commented 6 years ago

same for yue_hk-ud-test.conllu

keggsmurph21 commented 6 years ago

cs_pdt-ud-train-l.conllu eventually terminates (after 105 seconds), but it takes more than node's default memory capacity (512mb)

luckily, we can circumvent this issue by just running with a flag: node --max-old-space-size=XXX (where XXX is given in megabytes)

keggsmurph21 commented 6 years ago

ja_bccwj-ud-train.conllu terminates in 98 seconds

keggsmurph21 commented 6 years ago

ru_syntagrus-ud-train.conllu terminates in 140 seconds