ipno-llead / processing

Processing repo for the Innocence Project New Orleans' Louisiana Law Enforcement Accountability Database
7 stars 5 forks source link

bug/canoncicals #476

Closed ayyubibrahimi closed 1 year ago

pckhoi commented 1 year ago

So looks like it is trying to queue pdf for OCR, which isn't supposed to happen. All OCR queueing should be done locally. Can you dvc pull and make again? I think there are some metadata outputs that are outdated.

ayyubibrahimi commented 1 year ago

Running make returned an error.

md5sum fuse/cross_agency.py > .deba/md5/fuse/cross_agency.py.md5
key not found
key not found
make: *** [wrgl.mk:4: pull_person] Error 1
pckhoi commented 1 year ago

So I tried make locally and I ran into this error:

running ner/post_officer_history_reports.py
    Traceback (most recent call last):
      File "/Users/khoipham/projects/PPACT/ner/post_officer_history_reports.py", line 54, in <module>
        trained_model = spacy.load(
      File "/Users/khoipham/.virtualenvs/base/lib/python3.9/site-packages/spacy/__init__.py", line 51, in load
        return util.load_model(
      File "/Users/khoipham/.virtualenvs/base/lib/python3.9/site-packages/spacy/util.py", line 427, in load_model
        raise IOError(Errors.E050.format(name=name))
    OSError: [E050] Can't find model 'data/ner/post/post_officer_history/model/post_officer_history_3.model'. It doesn't seem to be a Python package or a valid path to a data directory.
make: *** [.deba/deps/ner.d:13: data/ner/advocate_post_officer_history_reports.csv] Error 1

Perhaps this file wasn't uploaded?

ayyubibrahimi commented 1 year ago

Hm. I just attempted to re-dvc push the file post_officer_history_3.model but it seems that everything is up to date. Will look into it further.

ayyubibrahimi commented 1 year ago

You should be able to run make now without an error.

pckhoi commented 1 year ago

Saw this during make:

make: Circular data/fuse/personnel_pre_post.csv <- data/match/post_officer_history.csv dependency dropped.
make: Circular data/fuse/allegation.csv <- data/fuse/allegation.csv dependency dropped.
make: Circular data/fuse/event_pre_post.csv <- data/fuse/event_pre_post.csv dependency dropped.
make: Circular data/fuse/event_pre_post.csv <- data/fuse/personnel_pre_post.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/fuse/use_of_force.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/match/post_officer_history.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/fuse/allegation.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/fuse/event_pre_post.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/fuse/personnel_pre_post.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/fuse/use_of_force.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/match/post_officer_history.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/fuse/allegation.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/fuse/event_pre_post.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/fuse/personnel_pre_post.csv dependency dropped.
make: Circular data/fuse/use_of_force.csv <- data/fuse/use_of_force.csv dependency dropped.

I think the fuse/post_officer_history.py script is problematic. I saw it both read and write to fuse/use_of_force.csv. Perhaps this file should be in another stage altogether.Basically we want all the dependency to form a directed acyclic graph (DAG). Drawing out the dependency in a graph application like draw.io could help.

pckhoi commented 1 year ago

Even if it's not causing problem right now it might not produce the results that you think it is.

pckhoi commented 1 year ago

Looks like fuse/all.py also has circular dependency. It is running 3 times for me now. Perhaps a tool like this could help: https://github.com/lindenb/makefile2graph

ayyubibrahimi commented 1 year ago

Thanks for the reminder. We discussed the fuse/post_officer_history.py script a while ago as a temporary solution. Agreed that it's time to find a permanent solution. I'll begin making those changes.

ayyubibrahimi commented 1 year ago

Apologies if you ran into another error. make should now run without error on your end.

pckhoi commented 1 year ago

Do you need me to look at the error over the weekend?

ayyubibrahimi commented 1 year ago

That would be great. The new process-data error is:

/home/runner/.local/bin//gsutil -m rsync -i -J -r gs://k8s-ocr-jobqueue-results/ocr/ data/ocr_results/
CommandException: arg (data/ocr_results/) does not name a directory, bucket, or bucket subdir.
If there is an object with the same path, please add a trailing
slash to specify the directory.
make: *** [Makefile:[33](https://github.com/ipno-llead/processing/actions/runs/4367563035/jobs/7639035885#step:7:34): ocr_results] Error 1
Error: Process completed with exit code 2.

And on my local when I run make I now receive the following error:

md5sum fuse/cross_agency.py > .deba/md5/fuse/cross_agency.py.md5
key not found
key not found
make: *** [wrgl.mk:4: pull_person] Error 1

I'll continue to attempt to debug these errors. Thanks!

I plan to address the circular dependencies by adding a new stage after this PR is merged.

pckhoi commented 1 year ago

I ran everything fine on local. Can you run wrgl pull --all and show me the output?

pckhoi commented 1 year ago

Looks like you're still training ner/post/post_officer_history/model/post_officer_history.model. Please finish training and push that file to DVC.

pckhoi commented 1 year ago

Also, please review my commits. deba.data should be used whenever you refer to any file inside data folder. Otherwise the job won't run correctly.

ayyubibrahimi commented 1 year ago

Thanks Khoi. Below is the error returned when I run wrgl pull --all

error fetching objects: error poping haves: GetCommit 2b6c52f53a18e0f0b232b63fddefa475 error: key not found

pckhoi commented 1 year ago

Just run it again and again until it succeed. I reckon there's a bug but for now it's not so serious that you cannot proceed.