Open jacobwegner opened 1 year ago
(TLDR: there is a lot of round tripping / converting that needs to happen to load UD treebanks; provided we can map the syntax trees back to tokens in the internal data model, it would be great if we could just read from the ConLL-U directly)
(And https://beyond-translation.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg001.parrish-eng1-trees:402?mode=syntax-trees isn't loading on top of that)
https://github.com/gregorycrane/Daphne is another repo to ingest
Most of the currently ingested treebanks are encoded from the Perseids Treebank template and conform to the AGDT v2 guidelines.
As part of helping @jchill-git bring additional Arabic data, I'd like to revisit some work from 2021 that dealt with UD / CoNLL-U Format.
In 2021, I was experimenting with a pipeline that would:
I think it'd be great to have tighter integration with ConLL-U / spaCy for loading treebanks. I hope I can spend some time on this before @jchill-git is at the point where he wants to load syntax trees into Beyond Translation.