scaife-viewer / sv-mini-atlas

ATLAS implementation for the Scaife "SV Mini" prototype
https://scaife-viewer.org/
MIT License
1 stars 1 forks source link

For discussion only: Token data model #7

Closed jacobwegner closed 3 years ago

jacobwegner commented 4 years ago

@jtauber wrote:

remember that, for the purposes of our ingestion, the works with token-level annotations are already tokenized

(Hopefully keeping this discussion distinct from Spike to explore normalizing content (tokens as text parts at an exemplar level)...) the token-level annotations can be roughly mapped to white-space separation of the words in the lowest text part.

For what we've done in Digital Sirah, the "values" of a token include punctuation.

# split on whitespace

Token
  - value # includes punctuation, e.g. οὐλομένην,

For the current scaife.perseus.org implementation, the word_tokens in the response strip out whitespace and punctuation from their values:

https://scaife.perseus.org/library/passage/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1-1.7/json/

# strip whitespace and punctuation, address using offsets

Token
  - word_value # οὐλομένην
  - character-offset # 38
  - cts-string-index # 1
  - type # word

The offset and string index information allows us to highlight the token within the reader:

https://scaife.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1-1.7/?highlight=%CE%BF%E1%BD%90%CE%BB%CE%BF%CE%BC%CE%AD%CE%BD%CE%B7%CE%BD%5B1%5D

Looking at the source data for Enchiridion, I see that the source data has both a text value and a word value:

Token
- text_value #  ἡμῖν,
- word_value #  ἡμῖν

My thought for handling the following cards:

Was that we would (at least initially) store that text_value

ending up with something like:

Token
  - TextPart (FK)
  - text
  - word
  - uuid
  - lemma
  - gloss
  - part_of_speech
  - tag
  - case
  - mood
  - named_entity

  (automatically generated)
  - idx (relative to version/exemplar)
  - position (relative to lowest text part)

I hadn't considered if we would want to retain the character-offset and

cts-string-index information from above (that might inform As a reader, I want the browser URL to update when I select tokens); with Digital Sirah as an example, we had been passing around line idx values to refer to a particular line. I could see a world where we could do a similar thing with a "word token" exemplar of the Iliad, so that rather than something relying on those cts-string-index values:

https://scaife.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1-1.7/?highlight=%CE%BF%E1%BD%90%CE%BB%CE%BF%CE%BC%CE%AD%CE%BD%CE%B7%CE%BD%5B1%5D

you could even reference a range of tokens by idx

(and even end up with a URN like urn:cts:greekLit:tlg0012.tlg001.perseus-grc2.word_tokens:1-6 or URL like https://sv-mini.netlify.com/reader?urn=urn%3Acts%3AgreekLit%3Atlg0012.tlg001.perseus-grc2.word_tokens:%3A1-6?highlight=6)

jacobwegner commented 4 years ago

(Eventually this might get modeled as something like django/deps)

jacobwegner commented 4 years ago

@jtauber and I chatted a bit; we're going to drop character-offset and cts-string-index initially and rely on the position / idx values.

we can also just roll with a text_value and word_value and revisit things like the norm value or tokenizing punctuation and whitespace in the future.

jacobwegner commented 4 years ago

I've done just the most basic of implementations and deployed as a review app:

--[Tokenizing the Iliad]-- Created 111864 tokens for version: urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:

Sample queries are in README.md.

curl 'https://mini-stack-a-spike-toke-t7jtur.herokuapp.com/graphql/' \
  --data-binary '{"query":"{\n  textParts(urn_Startswith: \"urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:2.\", first: 5) {\n    edges {\n      node {\n        ref\n        textContent\n        tokens {\n          edges {\n            node {\n              value\n              idx\n            }\n          }\n        }\n      }\n    }\n  }\n}","variables":null}' \
  -H 'Content-Type: application/json' \
  --compressed | jq
jacobwegner commented 4 years ago

(review app is at https://mini-stack-a-spike-toke-t7jtur.herokuapp.com/graphql/)

jacobwegner commented 4 years ago

As an ATLAS consumer, I want to retrieve white-space separated tokens for a range of text parts

jacobwegner commented 4 years ago

@jhrr I'll get this to a proper non-WIP PR soon, but at a high level the sanity check is, we have a Token model that we can apply additional annotations to, and retrieve those tokens/annotations using any of our text part endpoints.

I've got several cards to groom in Trello regarding tokens as exemplars, thinking through speeding up ingestion or additional modeling, but if this first pass seems sane, it gives us a good starting point for doing some token-based functionality in the reader.

jhrr commented 4 years ago

@jacobwegner LGTM

jacobwegner commented 3 years ago

This branch has gone stale