knaw-huc / globalise-tools

tools for globalise tasks
Apache License 2.0
1 stars 1 forks source link

Add script to construct web annotations from per-page language detection #8

Open proycon opened 1 month ago

proycon commented 1 month ago

The output of the automatic language detection pipeline (knaw-huc/team-text-backlog#65) will be a TSV file with columns (amongst others!):

inv_nr page_no langs

langs is a comma separated list of iso-639-3 language codes representing the languages that were found on the page.

This information needs to be translated to Web Annotations so people can search for languages in TextAnnoViz.

In addition to the automated output, there will be a similar TSV file with manually curated results, if a inv_nr and page_no occurs in there, it should override/ignore whatever is in the automated output's TSV.

I'll post a comment with links to the actual TSV files once they are ready.

@brambg: You have most expertise with the web annotation output and feeling for how that integrates into AnnoRepo/Broccoli/ TextAnnoViz . I could also give it a try, but perhaps this is something you can pick up?

@kintopp: Thus far this addresses only the page results, do we also want to propagate the paragraph or even line results (noisy!) to the rest of the system?

brambg commented 1 month ago

Re: web annotations: I can incorporate the lang information as an extra body.metadata.lang field in the Page annotations in the gt-untangle-globalise script: https://github.com/knaw-huc/globalise-tools/blob/main/scripts/gt-untangle-globalise.py

proycon commented 1 month ago

That looks good to me yeah, with multiple lang properties if there are multiple languages I assume?

brambg commented 1 month ago

Yes, I'll make it a a list:

"lang": [ "nl", "de", "la"]

proycon commented 1 month ago

I forgot to provide further feedback:

The pipeline output (pages.lang.tsv) is published in https://github.com/globalise-huygens/language-identification-data/tree/main/latin-script-pages . The manual corrected output should override whatever is in pages.lang.tsv and is still pending (@kintopp )

proycon commented 1 month ago

Yes, I'll make it a a list:

"lang": [ "nl", "de", "la"]

Right, you'll get iso-639-3 from the pipeline, I think @kintopp prefers that for the final web annotations too.

proycon commented 2 weeks ago

Another small update: The pipeline output (pages.lang.tsv) published in https://github.com/globalise-huygens/language-identification-data/tree/main/latin-script-pages now already includes all manually curated output as well. There's a column corrected with values 0 or 1 indicating if the entry is the result of a manual correction or not. You can propagate that information to the web annotation output for provenance reasons. This means you only need that one file as input.

kintopp commented 2 weeks ago

Yes, I'll make it a a list: "lang": [ "nl", "de", "la"]

Right, you'll get iso-639-3 from the pipeline, I think @kintopp prefers that for the final web annotations too.

Yes, we're going to use three-letter language codes (iso-639-3) wherever we can in Globalise in place of two-letter ones (iso-639-1).

brambg commented 6 days ago

Does the globalise team have a preference for how the lang information should be added to the Page annotation?

For example, for the Page annotation https://annorepo.globalise.huygens.knaw.nl/w3c/globalise-2024-03-18/44f1b698-71fd-4bf7-8414-641655d3b60e which fields should be added to the body.metadata ?

"lang": ["nld","fra"],
"langCorrected": false

or nested:

"languageDetected": {
  "lang": ["nld","fra"],
  "corrected": false
}