Closed jacobwegner closed 4 years ago
(Eventually this might get modeled as something like django/deps)
@jtauber and I chatted a bit; we're going to drop character-offset
and cts-string-index
initially and rely on the position / idx values.
we can also just roll with a text_value
and word_value
and revisit things like the norm value or tokenizing punctuation and whitespace in the future.
I've done just the most basic of implementations and deployed as a review app:
--[Tokenizing the Iliad]-- Created 111864 tokens for version: urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:
Sample queries are in README.md.
curl 'https://mini-stack-a-spike-toke-t7jtur.herokuapp.com/graphql/' \
--data-binary '{"query":"{\n textParts(urn_Startswith: \"urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:2.\", first: 5) {\n edges {\n node {\n ref\n textContent\n tokens {\n edges {\n node {\n value\n idx\n }\n }\n }\n }\n }\n }\n}","variables":null}' \
-H 'Content-Type: application/json' \
--compressed | jq
(review app is at https://mini-stack-a-spike-toke-t7jtur.herokuapp.com/graphql/)
@jhrr I'll get this to a proper non-WIP PR soon, but at a high level the sanity check is, we have a Token
model that we can apply additional annotations to, and retrieve those tokens/annotations using any of our text part endpoints.
I've got several cards to groom in Trello regarding tokens as exemplars, thinking through speeding up ingestion or additional modeling, but if this first pass seems sane, it gives us a good starting point for doing some token-based functionality in the reader.
@jacobwegner LGTM
This branch has gone stale
@jtauber wrote:
(Hopefully keeping this discussion distinct from Spike to explore normalizing content (tokens as text parts at an exemplar level)...) the token-level annotations can be roughly mapped to white-space separation of the words in the lowest text part.
For what we've done in Digital Sirah, the "values" of a token include punctuation.
For the current scaife.perseus.org implementation, the
word_tokens
in the response strip out whitespace and punctuation from their values:https://scaife.perseus.org/library/passage/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1-1.7/json/
The offset and string index information allows us to highlight the token within the reader:
https://scaife.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1-1.7/?highlight=%CE%BF%E1%BD%90%CE%BB%CE%BF%CE%BC%CE%AD%CE%BD%CE%B7%CE%BD%5B1%5D
Looking at the source data for Enchiridion, I see that the source data has both a
text
value and aword
value:My thought for handling the following cards:
Was that we would (at least initially) store that
text_value
ending up with something like:
I hadn't considered if we would want to retain the
character-offset
andcts-string-index
information from above (that might inform As a reader, I want the browser URL to update when I select tokens); with Digital Sirah as an example, we had been passing around line idx values to refer to a particular line. I could see a world where we could do a similar thing with a "word token" exemplar of the Iliad, so that rather than something relying on thosects-string-index
values:https://scaife.perseus.org/reader/urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:1.1-1.7/?highlight=%CE%BF%E1%BD%90%CE%BB%CE%BF%CE%BC%CE%AD%CE%BD%CE%B7%CE%BD%5B1%5D
you could even reference a range of tokens by idx
(and even end up with a URN like
urn:cts:greekLit:tlg0012.tlg001.perseus-grc2.word_tokens:1-6
or URL like https://sv-mini.netlify.com/reader?urn=urn%3Acts%3AgreekLit%3Atlg0012.tlg001.perseus-grc2.word_tokens:%3A1-6?highlight=6)