Provide breadcrumbs / code hooks for "partial re-ingestion" of content

scaife-viewer / beyond-translation-site

Site used to iterate on translation alignments within the Scaife Viewer ecosystem

3 stars 4 forks source link

Provide breadcrumbs / code hooks for "partial re-ingestion" of content #152

Open jacobwegner opened 1 year ago

jacobwegner commented 1 year ago

The current ingestion process is idempotent; it assumes that we're always building up data from scratch, because that's what we do when we deploy the site.

During local development, I have a few shortcuts that I use to give a tighter "feedback loop" when working on a particular annotation.

I'd like to have this to support @jchill-git , @gregorycrane and others who may be doing a lot more content previewing / editing than I have been in the past...it will also help us to be better at incremental updates to content when content moves out of this "code" repo and into content repos like https://github.com/PerseusDL/canonical-greekLit or https://github.com/scaife-viewer/ogl-pdl-annotations.

jacobwegner commented 1 year ago

@jchill-git I've been working on this today and will hopefully circle back tomorrow. Ping me here or on Slack if there is anything else I can help with as far as getting the texts / alignments in.

If you have been able to get your "version" into the database, this might be a helpful code snippet for getting out the tokens:

from pathlib import Path
from scaife_viewer.atlas.parallel_tokenizers import tokenize_text_parts

outdir = Path(".")  # the current directory, e.g. backend/

version_urn = "urn:cts:greekLit:tlg0012.tlg001.perseus-grc2:"  # replace with your version URN

outf = tokenize_text_parts(outdir, version_urn )  # writes out to urnctsgreeklittlg0012tlg001perseus-grc2.csv

Snippet for that CSV file:

https://gist.github.com/jacobwegner/3a96e1763b7bc22d827680db1351a377

This would give you a CSV that could be useful in a dataframe that has calculated that ve_ref value for each token.

jacobwegner commented 1 year ago

https://github.com/scaife-viewer/backend/commit/f4b4ecf9153cf6d4cd5badeb724593258038bd91 provides an initial implementation of partial ingestion / re-ingestion, specifically the test_partial_ingestion function.
https://github.com/scaife-viewer/backend/commit/e324cef87605988c21a83b2cb03fabc4f0a4430f provides tokenize_textparts_and_insert, which could then be used to tokenize the content that was partially ingested

I'll keep working on on the backend branch and provide updates on my progress on this issue.

jacobwegner commented 1 year ago

My commit in https://github.com/scaife-viewer/backend/commit/f4b4ecf9153cf6d4cd5badeb724593258038bd91 wasn't working for the Arabic content in the Codespace today; need to take a closer look.

The other thing I'd like to capture here and add a hook / to documentation is how the SV_ATLAS_DATA_DIR setting works.

If we made it an environment variable, that would allow folks to use a subset of the data in a data-wip directory or something like that. E.g.

data-wip/
├─ library/
│  ├─ <textgroup>/
│  │  ├─ metadata.json  # texgroup metadata
│  │  ├─ <work>/
│  │  │  ├─ metadata.json  # work and version metadata
│  │  │  ├─ <version>.txt  # version content

export SV_ATLAS_DATA_DIR=data-wip

./manage.py prepare_atlas_db --force

Files could be worked on from within data-wip (and even tracked in Git).

Once the file was ready for promotion to data/, it would be moved and updated in Git.