Closed keggsmurph21 closed 6 years ago
so bm_crb-ud-dev.conllu
looks like it was just a formatting issue with their data, so i submitted a pull request here
for fr_gsd-ud-train.conllu
it seems like the file just got corrupted while it was writing to disk ... weird
fro_srcmf-ud-dev.conllu
was an issue with the sentence splitter (fixed in notatrix
commit 79df2be
)
hy_armtdp-ud-train.conllu
is working now, so it was probably the same issue as fro_srcmf-ud-dev.conllu
ja_gsd-ud-new_train.conllu
was an issue with their dataset, check out this pull request for details
myv-ud-dev.conllu
, see this pull request
myv_ChetvergovJevgenij_Velenj-vajgeljtj_1992_UD-dev-2011.conllu
, see this issue
I agree pushing the treebank maintainers to fix CoNLL-U format errors via GitHub issues may be helpful (and PRs are even more helpful, if the given treebank has "Contributing: here" in README.txt). Just for your info: most of these errors (all I checked) should be detected by validate.py and these errors are already reported at http://quest.ms.mff.cuni.cz/cgi-bin/zeman/unidep/validation-report.pl. Only treebanks with no format errors can be released in UD. If you want to work with the released treebanks, use the "master" branch instead of "dev" or download UD2.2 from http://hdl.handle.net/11234/1-2837 (or wait until November 15, 2018 for UD2.3). See also http://universaldependencies.org/release_checklist.html#validation
That said, if you spot any format errors which are not detected by validate.py (or content errors not detected by Udapi, but easy to detect automatically without many false alarms), please fill a new issue/PR to improve these validators - or improve the UD guidelines to make it clear what is valid and what not.
myv_jr-ud-dev.conllu
is basically the same as the other myv*.conllu
files
@martinpopel well the reason i am using the dev
branches is because i intend this to be a tool for developers. i'm working on handling parse errors better internally, but i figured i might as well push my discoveries upstream when they're obviously just spelling/formatting errors
i appreciate the feedback though :) i'll check out those other tools
swl_sslc-ud-*.conllu
all pass now, must have fixed whatever the issue was
same for yue_hk-ud-test.conllu
cs_pdt-ud-train-l.conllu
eventually terminates (after 105 seconds), but it takes more than node's default memory capacity (512mb
)
luckily, we can circumvent this issue by just running with a flag: node --max-old-space-size=XXX
(where XXX
is given in megabytes)
ja_bccwj-ud-train.conllu
terminates in 98 seconds
ru_syntagrus-ud-train.conllu
terminates in 140 seconds
using the script added in f29ad85, I downloaded all of the treebanks (312 in total, names and sizes listed here from the universal dependencies home page and tried to parse them with the
notatrix
tool (this is the same pipeline used for/upload
and for parsing thetextarea
).on first try,
297
passed,3
timed out (i.e. took longer than 60 seconds), and12
failed for one reason or another. once i figure out why they failed, i'll add to this issuetiming out:
cs_pdt-ud-train-l.conllu
(77mb
)ja_bccwj-ud-train.conllu
(85mb
)ru_syntagrus-ud-train.conllu
(77mb
)failing:
bm_crb-ud-dev.conllu
fr_gsd-ud-train.conllu
fro_srcmf-ud-dev.conllu
hy_armtdp-ud-train.conllu
ja_gsd-ud-new_train.conllu
myv-ud-dev.conllu
myv_ChetvergovJevgenij_Velenj-vajgeljtj_1992_UD-dev-2011.conllu
myv_jr-ud-dev.conllu
swl_sslc-ud-dev.conllu
swl_sslc-ud-test.conllu
swl_sslc-ud-train.conllu
yue_hk-ud-test.conllu
so maybe not so surprisingly, the timeout failures are the biggest corpora in the data set. this is where it would be nice to have the sqlite3 interface