Open proycon opened 1 month ago
Re: web annotations: I can incorporate the lang information as an extra body.metadata.lang
field in the Page
annotations in the gt-untangle-globalise
script:
https://github.com/knaw-huc/globalise-tools/blob/main/scripts/gt-untangle-globalise.py
That looks good to me yeah, with multiple lang properties if there are multiple languages I assume?
Yes, I'll make it a a list:
"lang": [ "nl", "de", "la"]
I forgot to provide further feedback:
The pipeline output (pages.lang.tsv
) is published in https://github.com/globalise-huygens/language-identification-data/tree/main/latin-script-pages . The manual corrected output should override whatever is in pages.lang.tsv
and is still pending (@kintopp )
Yes, I'll make it a a list:
"lang": [ "nl", "de", "la"]
Right, you'll get iso-639-3 from the pipeline, I think @kintopp prefers that for the final web annotations too.
Another small update: The pipeline output (pages.lang.tsv
) published in https://github.com/globalise-huygens/language-identification-data/tree/main/latin-script-pages now already includes all manually curated output as well. There's a column corrected
with values 0
or 1
indicating if the entry is the result of a manual correction or not. You can propagate that information to the web annotation output for provenance reasons. This means you only need that one file as input.
Yes, I'll make it a a list:
"lang": [ "nl", "de", "la"]
Right, you'll get iso-639-3 from the pipeline, I think @kintopp prefers that for the final web annotations too.
Yes, we're going to use three-letter language codes (iso-639-3) wherever we can in Globalise in place of two-letter ones (iso-639-1).
Does the globalise team have a preference for how the lang information should be added to the Page annotation?
For example, for the Page annotation https://annorepo.globalise.huygens.knaw.nl/w3c/globalise-2024-03-18/44f1b698-71fd-4bf7-8414-641655d3b60e
which fields should be added to the body.metadata
?
"lang": ["nld","fra"],
"langCorrected": false
or nested:
"languageDetected": {
"lang": ["nld","fra"],
"corrected": false
}
The output of the automatic language detection pipeline (knaw-huc/team-text-backlog#65) will be a TSV file with columns (amongst others!):
inv_nr page_no langs
langs is a comma separated list of iso-639-3 language codes representing the languages that were found on the page.
This information needs to be translated to Web Annotations so people can search for languages in TextAnnoViz.
In addition to the automated output, there will be a similar TSV file with manually curated results, if a inv_nr and page_no occurs in there, it should override/ignore whatever is in the automated output's TSV.
I'll post a comment with links to the actual TSV files once they are ready.
@brambg: You have most expertise with the web annotation output and feeling for how that integrates into AnnoRepo/Broccoli/ TextAnnoViz . I could also give it a try, but perhaps this is something you can pick up?
@kintopp: Thus far this addresses only the page results, do we also want to propagate the paragraph or even line results (noisy!) to the rest of the system?