Enhancement/post ohrs - Githubissues

ayyubibrahimi commented 1 year ago

Hi @pckhoi new error at the process-data stage where the runner is being cancelled. Please advise. Thanks!

https://github.com/ipno-llead/processing/actions/runs/3734909415

pckhoi commented 1 year ago

Looks like it was OOM-killed. Just rerun it I guess. Did you put any massive data transformation in this PR?

pckhoi commented 1 year ago

Have you rerun the OCR scripts on local and sync new changes to DVC? If the job need to run OCR then that would explain it.

pckhoi commented 1 year ago

I think it might make more sense to decide whether to go forward with Doctr first (and finish or close #408) before working on this PR.

ayyubibrahimi commented 1 year ago

Agreed. Will put this on hold.

ayyubibrahimi commented 1 year ago

Thanks for insight re this likely being a ocr issue. PRs that do not involve ocr are working fine. I say we move forward with Doctr.

pckhoi commented 1 year ago

@ayyubibrahimi can you show me the full output when it was failing?

ayyubibrahimi commented 1 year ago

Can you be more specific? I'm not certain that it has failed. Only that the job's status for the new ocr file read queued then processing That said, I'm currently setting up my new computer. I can re-try it/provide the full output from a new attempt after it is set up if that would be helpful.

pckhoi commented 1 year ago

oh, so you mean the output? Which file had this status?

ayyubibrahimi commented 1 year ago

On 1/26/2023 the ocr_status of data/ocr/post_officer_history_reports_1_26_2023_pdfs.csv was queued. The status since 1/27/2023 has been processing. The same has occurred for data/ocr/post_officer_history_reports_9_30_2022_pdfs.csv

github-actions[bot] commented 1 year ago

Review data changes at tx/d3ae4aa5-3e51-4d73-8752-a855dd99f1c9

When this PR is merged, this transaction will be applied.

pckhoi commented 1 year ago

Right now I'm having an issue with dvc pull. I got this error:

ERROR: unexpected error - 'NoneType' object has no attribute 'get_extra_info'

Did you get the same thing?

ayyubibrahimi commented 1 year ago

Hmm. No I didn't get the same. Just cloned the repo and successfully ran dvc pull on the new computer. That said, something weird did happen. The raw_post_officer_history_reports_1_26_2023.dvc file was pulled, but the folder containing the data did not, namely the folder named data/raw_post_officer_history_reports_1_26_2023. I recently pushed the data and I've had a poor internet connection the past few days, which might be responsible for this specific issue.

Let me get finished setting up on this computer, re-run the ocr, and see if the ocr step continues to not work on my end. Having an issue installing k8s-ocr-jobqueue. Opening an issue now.

github-actions[bot] commented 1 year ago

Review data changes at tx/885ae066-d963-4a62-b9c4-83bd84c02c28

When this PR is merged, this transaction will be applied.

ayyubibrahimi commented 1 year ago

@pckhoi Can you run dvc pull again? / check if the ocr scripts are indeed now processing as they claim to be on local.

pckhoi commented 1 year ago

@ayyubibrahimi I'm still seeing the same error: AttributeError: 'NoneType' object has no attribute 'get_extra_info'. Can you ascertain that you're not seeing this error on your end?

ayyubibrahimi commented 1 year ago

Confirmed that I am not seeing that error on my end. I have also successfully pushed and pulled new data with dvc without error.

pckhoi commented 1 year ago

Turn out it is my VPN that prevented dvc pull. Looks like I'm pulling again now. Will continue to investigate the issue you raised.

pckhoi commented 1 year ago

@ayyubibrahimi so did you push the latest code? I can't seem to find that file from my output.

$ make data/ocr/post_officer_history_reports_1_26_2023_pdfs.csv
.deba/deps/fuse.d:49: warning: overriding recipe for target 'data/fuse/post_carencro_pd.csv'
.deba/deps/fuse.d:43: warning: ignoring old recipe for target 'data/fuse/post_carencro_pd.csv'
.deba/deps/fuse.d:49: warning: overriding group membership for target 'data/fuse/post_carencro_pd.csv'
make: *** No rule to make target 'data/ocr/post_officer_history_reports_1_26_2023_pdfs.csv'.  Stop.

ayyubibrahimi commented 1 year ago

Apologies. Just pushed the latest code @pckhoi. Looks like there is an error at the wrgl pull stage that I don't follow.

pckhoi commented 1 year ago

So I still cannot find that file. Are you sure the name is data/ocr/post_officer_history_reports_1_26_2023_pdfs.csv?

pckhoi commented 1 year ago

I inspected the outputs of ocr/post_officer_history_reports.py with this neat snippet:

names = [
    "ocr/post_officer_history_reports_pdfs.csv",
    "ocr/post_officer_history_reports_9_16_2022_pdfs.csv",
    "ocr/post_officer_history_reports_2023_pdfs.csv",
    "ocr/post_officer_history_reports_advocate_pdfs.csv"
]
for name in names:
    print(name)
    _df = pd.read_csv(deba.data(name))
    print(_df.ocr_status.unique())
    print()

And I found out that only data/ocr/post_officer_history_reports_2023_pdfs.csv has nothing but processing rows. I requeue this file and it seems to be processing just fine. Sometime the ocr job run out of memory and get killed. If you ever see a file that take longer than 1 day then feel free to just requeue it like this and check again in a few hours:

OCR_REQUEUE=true make data/ocr/post_officer_history_reports_2023_pdfs.csv

Another issue that I'm seeing is that the files Batch_1_IPNO.pdf, Batch_2_IPNO.pdf, Batch_3_IPNO.pdf, and Batch_4_IPNO.pdf are all rotated 90 degrees and Doctr cannot handle those files. I think we need to write a script to mass rotate those files. If you want to try your hand, I suggest using the library PyPDF2: https://gist.github.com/jb0hn/760447d7737555793efe48fb4192802c

If you do decide to write a script, put it under the scripts folder.

pckhoi commented 1 year ago

Also please split the script ocr/post_officer_history_reports.py into one script per output. Otherwise when you want to requeue it will requeue everything in that script. Not a big deal but just a little less efficient.

ayyubibrahimi commented 1 year ago

Apologies for the confusion. I renamed data/ocr/post_officer_history_reports_1_26_2023_pdfs.cs to data/ocr/post_officer_history_reports_2023_pdfs.csv.

Thanks for the advice re re-queuing. Sounds like the issue.

I'll split out the ocr scripts into one script per output and write a script to deal with the PDFs that need to be rotated. Thanks Khoi!

ayyubibrahimi commented 1 year ago

Confirmed that the script has successfully processed.

ayyubibrahimi commented 1 year ago

@pckhoi Looks like the issue is that my credentials aren't working atm. See below. Can you please re-queue ocr/post_officer_history_reports_2023_rotated.py while I fix? Thanks.

pckhoi commented 1 year ago

I requeued the script. Should be finished in a few hours.

On Mon, Feb 13, 2023 at 4:23 PM Ayyub Ibrahim @.***> wrote:

@pckhoi https://github.com/pckhoi Looks like the issue is that my credentials aren't working atm. See below. Can you please re-queue ocr/post_officer_history_reports_2023_rotated.py while I fix? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/ipno-llead/processing/pull/443#issuecomment-1427605513, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBALKV6RBWTEHXMYAB3WXH4Q3ANCNFSM6AAAAAATDWUQAI . You are receiving this because you were mentioned.Message ID: @.***>

github-actions[bot] commented 1 year ago

Review data changes at tx/98d03dc4-ecbb-4c55-b944-e124815a2b67

When this PR is merged, this transaction will be applied.

github-actions[bot] commented 1 year ago

Review data changes at tx/a3fd1ae2-5852-4570-b5f7-f0e987fc87e3

When this PR is merged, this transaction will be applied.

github-actions[bot] commented 1 year ago

Review data changes at tx/9b10aefb-2887-45b8-be7e-c142849e120f

When this PR is merged, this transaction will be applied.

github-actions[bot] commented 1 year ago

Transaction tx/9b10aefb-2887-45b8-be7e-c142849e120f applied.

ipno-llead / processing

Enhancement/post ohrs #443