Document and improve support for UD treebanks

jacobwegner commented 1 year ago

Most of the currently ingested treebanks are encoded from the Perseids Treebank template and conform to the AGDT v2 guidelines.

As part of helping @jchill-git bring additional Arabic data, I'd like to revisit some work from 2021 that dealt with UD / CoNLL-U Format.

In 2021, I was experimenting with a pipeline that would:

Read in a source text, process using spaCy and output in ConLL-U
Read in ConLL-U and write out to the intermediate format based on AGDT2
- Processing script: pipelines/syntax_trees/ud.py
- Sample output: tlg0012.tlg001.parrish-eng1.json
From the resulting output, we reconstituted a "version" in Beyond Translation (tlg0012.tlg001.parrish-eng1-trees (TODO: find this script and link to it))
and then linked the spaCy output to that version (syntax-trees/tlg0012.tlg001.parrish-eng1.json)

I think it'd be great to have tighter integration with ConLL-U / spaCy for loading treebanks. I hope I can spend some time on this before @jchill-git is at the point where he wants to load syntax trees into Beyond Translation.

jacobwegner commented 1 year ago

(TLDR: there is a lot of round tripping / converting that needs to happen to load UD treebanks; provided we can map the syntax trees back to tokens in the internal data model, it would be great if we could just read from the ConLL-U directly)

jacobwegner commented 1 year ago

(And https://beyond-translation.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg001.parrish-eng1-trees:402?mode=syntax-trees isn't loading on top of that)

jacobwegner commented 6 months ago

https://github.com/gregorycrane/Daphne is another repo to ingest

scaife-viewer / beyond-translation-site

Document and improve support for UD treebanks #151