scaife-viewer / beyond-translation-site

Site used to iterate on translation alignments within the Scaife Viewer ecosystem
3 stars 4 forks source link

Improve onboarding for developers #104

Open jacobwegner opened 1 year ago

jacobwegner commented 1 year ago

@pletcher raised an issue, in that we don't have documentation yet on working with a local copy of the ATLAS package within the docker-compose environment.

I've been doing most of my development on my local machine using a virtualenv and then using docker-compose to browse locally or for deployments.

I wanted to capture a few notes here and will turn that into a documentation update.

Working with a local checkout of the scaife-viewer-atlas package within docker-compose

1) Clone https://github.com/scaife-viewer/backend into the backend/ directory:

cd backend
git clone git@github.com:scaife-viewer/backend.git

Since this directory is mounted into the container, its contents can be edited locally or from within the container.

2) Install the local copy within a running Docker container

# Assumes docker-compose up

docker-compose exec -it atlas sh

# Uninstall the existing package
$ pip uninstall scaife-viewer-atlas
Found existing installation: scaife-viewer-atlas 0.1a15
Uninstalling scaife-viewer-atlas-0.1a15:
  Would remove:
    /opt/scaife-stack/lib/python3.9/site-packages/scaife_viewer/*
    /opt/scaife-stack/lib/python3.9/site-packages/scaife_viewer_atlas-0.1a15.dist-info/*
Proceed (Y/n)? y
  Successfully uninstalled scaife-viewer-atlas-0.1a15

# Install the locally editable copy:
pip install -e backend/atlas

or

3) Edit requirements.txt to install the local copy the next time the container is built

diff --git a/backend/requirements.txt b/backend/requirements.txt
index c67b05f..03b3a51 100644
--- a/backend/requirements.txt
+++ b/backend/requirements.txt
@@ -11,7 +11,7 @@ jsonlines==2.0.0
 more_itertools==8.12.0
 pandas==1.3.1
 requests>=2.0.6
-scaife-viewer-atlas @https://github.com/scaife-viewer/backend/archive/623c79e414f7d11852195dbe8f8e0ecad2aba355.zip#subdirectory=atlas
+-e /opt/scaife-stack/src/backend/atlas
 whitenoise==4.1.4

After changes have been verified within scaife-stack-atlas, we should publish a new version of the package and update requirements.txt to use the remote version so that work can be reviewed by other developers / deployed on a review app instance.

pletcher commented 1 year ago

Thanks for this @jacobwegner!

I just wanted to document a few issues that I've run into while getting set up -- I'm patching some of these as I go, but we might want to do a sweep once the ROI feature is ready to go.

  1. TEXT_ANNOTATION_KIND_SCHOLIA and TEXT_ANNOTATION_KIND_SYNTAX_TREE appear to have been moved from models.py to constants.py. I changed the import statement in importers/text_annotations.py to reflect the new location.
  2. apply_token_annotations() is being called with reset=True in ingestion_pipeline.py. Removing the keyword argument (which isn't being used by apply_token_annotations() anymore) appears to fix the issue.
  3. Named entity lookup in importers/named_entities.py#_apply_entities appears to be failing. I've added a try-catch with some debugging output. UPDATE 2: It looks like the data/ directory that's included with the repo is outdated; deleting that and re-running python manage.py stage_atlas_data creates data that looks like what the script expects.
  4. I can't find the GrammaticalEntry, GrammaticalEntryCollection, and TextAnnotationCollection models that are imported in scaife_stack_atlas/temp.py -- I've commented those imports out, along with the functions that use them (and the calls to those functions in the settings.py pipeline)
  5. TextAnnotation seems no longer to belong to a collection as expected in temp.py
  6. There is no "ve_ref" column in the CSV that's read by extract_ref_and_token_position() -- the first row looks like:
    {
    "uuid":"t1.1_0",
    "value":"μῆνιν",
    "word_value":"μῆνιν",
    "lemma":"μῆνις",
    "gloss":"",
    "part_of_speech":"noun",
    "tag":"n-s---fa-",
    "case":"accusative",
    "mood":""
    }

    Judging by the regex at the top of the file, it looks like "ve_ref" maybe used to point to what's now the uuid column -- but the format has changed from what the regex expects, so the assert match statement fails.

I updated the regex and changed the lookup, and that mostly seems to be working. The problem now occurs when we get to update_version_tokens(): not all of the tokens exist in the CSV, so the lookup fails there.

As an aside, I'm also working on getting this working outside of Docker, as I worry that that extra layer is adding complexity for this initial setup phase.

I'll continue to update here as I encounter/fix the above issues. Some of the problems here might be user error on my part -- please let me know if that seems like the case!

jacobwegner commented 1 year ago

@pletcher: I'm very sorry about all this...I think I've caused some additional confusion by not having better documentation (in the existing README.md file or in this issue we're using to track updates!)

I'd be happy to sync up on a call or Slack to walk through some of these issues. temp.py is a gigantic hack that I want to revisit so we have proper "abstract" implementation of pipelines in scaife-viewer-atlas and project-level overrides in places like beyond-translation-site.

As far as all of the errors you're seeing, it's likely because you've checked out scaife-viewer-atlas as of main, but beyond-translation-site is using the branch atlas/grammar). I really want to get proper release processes for scaife-viewer-atlas so we can avoid issues like this.

To resolve your issues, I'd suggest stashing / copying the actual changes you needed to make to the GraphQL schema schema.py, and then creating a branch off of atlas/grammar:

# assuming git clone git@github.com:scaife-viewer/backend.git
git stash
git checkout atlas/grammar
git checkout -b atlas/<your-feature-name>
git stash pop
# continue on to changes to `scaife_viewer/atlas/schema.py`
# commit changes to atlas/<your-feature-name>
# push changes to GtiHub

After you've pushed any changes to GitHub for ATLAS, as mentioned above, we can update the pinned requirement in requirements.txt:

scaife-viewer-atlas @https://github.com/<your-org>/<your-fork-repo>/archive/<your-commit-sha>.zip#subdirectory=atlas

Again, I'm very sorry for the inconvenience, and let me know if a call or screenshare would help.

pletcher commented 1 year ago

@jacobwegner Please don't apologize! I just wanted to keep track of what I was running into.

As far as all of the errors you're seeing, it's likely because you've checked out scaife-viewer-atlas as of main, but beyond-translation-site is using the branch atlas/grammar). I really want to get proper release processes for scaife-viewer-atlas so we can avoid issues like this.

Ah, okay -- I was starting to wonder if that was the case. I'll checkout the correct branch and see if I can get set up that way. I don't think I'll need to do a stash pop yet since the changes that I've made so far have been focused on getting the data loaded.

Thanks for this again -- I'm happy to keep updating here, if that's okay?

jacobwegner commented 1 year ago

@pletcher: Updates here are fine...especially because I'm aware of where things aren't always tidy / clear for other developers to jump in, I'm happy to be available synchronously too so you're not paying a productivity tax due to my lack of clarity :-)

pletcher commented 1 year ago

Thanks! I think that'll be good to do soon as well, maybe late this week or early next? I'll send you an email :)

pletcher commented 1 year ago

Another update (2023-01-10): Switching to the atlas/grammar branch in the backend repo fixed almost everything. 🎉 It seems like the data are now just committed to the repo -- deleting data/ and running python manage.py stage_atlas_data led to errors in the ingestion process, but it went mostly fine (see below) with the committed data.

I had to bump a couple of dependencies to the latest minor versions (pandas to 1.5.2 and lxml to 4.9.2) because of being on an M1 Mac.

And in temp.py#process_file(), I added a check/early return in case data is a list: I think temp.py#get_paths() is pulling in the "manifest" file at data/annotations/text-alignments/alignments-tlg0012.tlg001.json, which errors out at the data["versions"] lookup.

jacobwegner commented 1 year ago

I had to bump a couple of dependencies to the latest minor versions (pandas to 1.5.2 and lxml to 4.9.2) because of being on an M1 Mac.

Totally fine; feel free to commit those as part of the PR

Re: stage_atlas_data....another README.md change I can make. This repository is an iteration of a few early repositories that gradually evolved to https://github.com/scaife-viewer/scaife-stack.

We decided not to ship data in the scaife-stack repository, but I've found it was easier to make the entire repository buildable with just git checkout.

It looks like we updated this in the top-level README.md, but not in backend/README.md.

jacobwegner commented 1 year ago

(Specifically https://github.com/scaife-viewer/beyond-translation-site/blob/main/README.md#loading-data-into-atlas)