Closed ayyubibrahimi closed 1 year ago
Looks like it was OOM-killed. Just rerun it I guess. Did you put any massive data transformation in this PR?
Have you rerun the OCR scripts on local and sync new changes to DVC? If the job need to run OCR then that would explain it.
I think it might make more sense to decide whether to go forward with Doctr first (and finish or close #408) before working on this PR.
Agreed. Will put this on hold.
Thanks for insight re this likely being a ocr issue. PRs that do not involve ocr are working fine. I say we move forward with Doctr.
@ayyubibrahimi can you show me the full output when it was failing?
Can you be more specific? I'm not certain that it has failed. Only that the job's status for the new ocr file read queued
then processing
That said, I'm currently setting up my new computer. I can re-try it/provide the full output from a new attempt after it is set up if that would be helpful.
oh, so you mean the output? Which file had this status?
On 1/26/2023 the ocr_status
of data/ocr/post_officer_history_reports_1_26_2023_pdfs.csv
was queued
. The status since 1/27/2023 has been processing
. The same has occurred for data/ocr/post_officer_history_reports_9_30_2022_pdfs.csv
Review data changes at tx/d3ae4aa5-3e51-4d73-8752-a855dd99f1c9
When this PR is merged, this transaction will be applied.
Right now I'm having an issue with dvc pull
. I got this error:
ERROR: unexpected error - 'NoneType' object has no attribute 'get_extra_info'
Did you get the same thing?
Hmm. No I didn't get the same. Just cloned the repo and successfully ran dvc pull
on the new computer. That said, something weird did happen. The raw_post_officer_history_reports_1_26_2023.dvc
file was pulled, but the folder containing the data did not, namely the folder named data/raw_post_officer_history_reports_1_26_2023
. I recently pushed the data and I've had a poor internet connection the past few days, which might be responsible for this specific issue.
Let me get finished setting up on this computer, re-run the ocr, and see if the ocr step continues to not work on my end. Having an issue installing k8s-ocr-jobqueue
. Opening an issue now.
Review data changes at tx/885ae066-d963-4a62-b9c4-83bd84c02c28
When this PR is merged, this transaction will be applied.
@pckhoi Can you run dvc pull
again? / check if the ocr
scripts are indeed now processing
as they claim to be on local.
@ayyubibrahimi I'm still seeing the same error: AttributeError: 'NoneType' object has no attribute 'get_extra_info'
. Can you ascertain that you're not seeing this error on your end?
Confirmed that I am not seeing that error on my end. I have also successfully pushed and pulled new data with dvc
without error.
Turn out it is my VPN that prevented dvc pull. Looks like I'm pulling again now. Will continue to investigate the issue you raised.
@ayyubibrahimi so did you push the latest code? I can't seem to find that file from my output.
$ make data/ocr/post_officer_history_reports_1_26_2023_pdfs.csv
.deba/deps/fuse.d:49: warning: overriding recipe for target 'data/fuse/post_carencro_pd.csv'
.deba/deps/fuse.d:43: warning: ignoring old recipe for target 'data/fuse/post_carencro_pd.csv'
.deba/deps/fuse.d:49: warning: overriding group membership for target 'data/fuse/post_carencro_pd.csv'
make: *** No rule to make target 'data/ocr/post_officer_history_reports_1_26_2023_pdfs.csv'. Stop.
Apologies. Just pushed the latest code @pckhoi. Looks like there is an error at the wrgl pull
stage that I don't follow.
So I still cannot find that file. Are you sure the name is data/ocr/post_officer_history_reports_1_26_2023_pdfs.csv
?
I inspected the outputs of ocr/post_officer_history_reports.py
with this neat snippet:
names = [
"ocr/post_officer_history_reports_pdfs.csv",
"ocr/post_officer_history_reports_9_16_2022_pdfs.csv",
"ocr/post_officer_history_reports_2023_pdfs.csv",
"ocr/post_officer_history_reports_advocate_pdfs.csv"
]
for name in names:
print(name)
_df = pd.read_csv(deba.data(name))
print(_df.ocr_status.unique())
print()
And I found out that only data/ocr/post_officer_history_reports_2023_pdfs.csv
has nothing but processing
rows. I requeue this file and it seems to be processing just fine. Sometime the ocr job run out of memory and get killed. If you ever see a file that take longer than 1 day then feel free to just requeue it like this and check again in a few hours:
OCR_REQUEUE=true make data/ocr/post_officer_history_reports_2023_pdfs.csv
Another issue that I'm seeing is that the files Batch_1_IPNO.pdf
, Batch_2_IPNO.pdf
, Batch_3_IPNO.pdf
, and Batch_4_IPNO.pdf
are all rotated 90 degrees and Doctr cannot handle those files. I think we need to write a script to mass rotate those files. If you want to try your hand, I suggest using the library PyPDF2: https://gist.github.com/jb0hn/760447d7737555793efe48fb4192802c
If you do decide to write a script, put it under the scripts
folder.
Also please split the script ocr/post_officer_history_reports.py
into one script per output. Otherwise when you want to requeue it will requeue everything in that script. Not a big deal but just a little less efficient.
Apologies for the confusion. I renamed data/ocr/post_officer_history_reports_1_26_2023_pdfs.cs
to data/ocr/post_officer_history_reports_2023_pdfs.csv
.
Thanks for the advice re re-queuing. Sounds like the issue.
I'll split out the ocr
scripts into one script per output and write a script to deal with the PDFs that need to be rotated. Thanks Khoi!
Confirmed that the script has successfully processed.
@pckhoi Looks like the issue is that my credentials aren't working atm. See below. Can you please re-queue ocr/post_officer_history_reports_2023_rotated.py
while I fix? Thanks.
I requeued the script. Should be finished in a few hours.
On Mon, Feb 13, 2023 at 4:23 PM Ayyub Ibrahim @.***> wrote:
@pckhoi https://github.com/pckhoi Looks like the issue is that my credentials aren't working atm. See below. Can you please re-queue ocr/post_officer_history_reports_2023_rotated.py while I fix? Thanks.
— Reply to this email directly, view it on GitHub https://github.com/ipno-llead/processing/pull/443#issuecomment-1427605513, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBALKV6RBWTEHXMYAB3WXH4Q3ANCNFSM6AAAAAATDWUQAI . You are receiving this because you were mentioned.Message ID: @.***>
Review data changes at tx/98d03dc4-ecbb-4c55-b944-e124815a2b67
When this PR is merged, this transaction will be applied.
Review data changes at tx/a3fd1ae2-5852-4570-b5f7-f0e987fc87e3
When this PR is merged, this transaction will be applied.
Review data changes at tx/9b10aefb-2887-45b8-be7e-c142849e120f
When this PR is merged, this transaction will be applied.
Transaction tx/9b10aefb-2887-45b8-be7e-c142849e120f applied.
Hi @pckhoi new error at the
process-data
stage where the runner is being cancelled. Please advise. Thanks!https://github.com/ipno-llead/processing/actions/runs/3734909415