Open jacobwegner opened 1 year ago
@jchill-git I've been working on this today and will hopefully circle back tomorrow. Ping me here or on Slack if there is anything else I can help with as far as getting the texts / alignments in.
If you have been able to get your "version" into the database, this might be a helpful code snippet for getting out the tokens:
from pathlib import Path
from scaife_viewer.atlas.parallel_tokenizers import tokenize_text_parts
outdir = Path(".") # the current directory, e.g. backend/
version_urn = "urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:" # replace with your version URN
outf = tokenize_text_parts(outdir, version_urn ) # writes out to urnctsgreeklittlg0012tlg001perseus-grc2.csv
Snippet for that CSV file:
https://gist.github.com/jacobwegner/3a96e1763b7bc22d827680db1351a377
This would give you a CSV that could be useful in a dataframe that has calculated that ve_ref
value for each token.
tokenize_textparts_and_insert
, which could then be used to tokenize the content that was partially ingestedI'll keep working on on the backend branch and provide updates on my progress on this issue.
My commit in https://github.com/scaife-viewer/backend/commit/f4b4ecf9153cf6d4cd5badeb724593258038bd91 wasn't working for the Arabic content in the Codespace today; need to take a closer look.
The other thing I'd like to capture here and add a hook / to documentation is how the SV_ATLAS_DATA_DIR
setting works.
If we made it an environment variable, that would allow folks to use a subset of the data in a data-wip
directory or something like that. E.g.
data-wip/
├─ library/
│ ├─ <textgroup>/
│ │ ├─ metadata.json # texgroup metadata
│ │ ├─ <work>/
│ │ │ ├─ metadata.json # work and version metadata
│ │ │ ├─ <version>.txt # version content
export SV_ATLAS_DATA_DIR=data-wip
./manage.py prepare_atlas_db --force
Files could be worked on from within data-wip
(and even tracked in Git).
Once the file was ready for promotion to data/
, it would be moved and updated in Git.
The current ingestion process is idempotent; it assumes that we're always building up data from scratch, because that's what we do when we deploy the site.
During local development, I have a few shortcuts that I use to give a tighter "feedback loop" when working on a particular annotation.
I'd like to have this to support @jchill-git , @gregorycrane and others who may be doing a lot more content previewing / editing than I have been in the past...it will also help us to be better at incremental updates to content when content moves out of this "code" repo and into content repos like https://github.com/PerseusDL/canonical-greekLit or https://github.com/scaife-viewer/ogl-pdl-annotations.