Closed pckhoi closed 1 year ago
Review data changes at tx/a4467f78-e0c5-4740-8362-e50158b20480
When this PR is merged, this transaction will be applied.
@ayyubibrahimi looks like the ner output have changed since ocr is now produced by doctr. Can you please go through the scripts in the ner
stage, inspect their output and make sure they are not outputting anything strange? 🙏
The error from process-data job looks to me like the output of the ner stage have changed greatly. To update any output from the ner stage, invoke the script like this:
OCR_ENSURE_COMPLETE=true make -B data/ner/post_officer_history_reports_9_16_2022.csv
OCR_ENSURE_COMPLETE=true
will make the script error out if there are pdf files not completely processed. -B
remakes the file and all of its dependencies.
Hey @pckhoi you're right that the output of the ocr
and ner
stages are strange. The spacy model is now essentially useless (attached is an example of the data that the model has been trained on). That said, I think that it would be possible to produce a new model that works well because of the amount of training data that we have, but it would definitely be difficult if we had less training data because the output isn't as uniform as before.
@ayyubibrahimi I think Doctr should definitely produce better OCR results than Tesseract. Lemme finish the vscode extension so you can check.
If Doctr is better then how much effort do you think it take to retrain the Spacy model? Should we pace it out by keeping the Tesseract OCR results for now?
I agree with you. I think that the format of the post officer history report is the issue. I say we move forward with Doctr. I'll begin working on the new model.
Awesome! Feel free to commit to this branch.
Cool. Should be good in the next couple of few days.
@pckhoi I misspoke. Doctr won't work for the post officer history reports because it introduces ambiguity re the left_reason
. See the attached examples. In the Tesseract example, it's clear from where and when Bobbit resigned.
Doctr_OCR_POST_Officer_History_Report.txt Tessertact_OCR_POST_Officer_History_Report.txt
Just reviewed the OCR output for other data. It seems that the issue isn't limited to the post officer history reports, as the same formatting issue is present in the LSP reports. The issue, as I see it, is that Doctr processes successive strings as headers, and then subsequent successive strings as the body text of those headers, despite the original formatting which is usually one long string of text.
As an example, see how to
, from
, and subject
on page 1
of the PDF vs how it is processed in Doctr (and Tesseract). See previous discipline
on page 7
for another example. This formatting introduces ambiguity, similar to the post officer history reports, where it is difficult to determine what event a date or action corresponds to.
Doctr_beasley_william_investigative_report.txt Tesseract_beasley_william_investigative_report.txt beasley_william_investigative_report.pdf
Doctr output blocks of text. Each block consists of lines. Each line consists of words. Block, line, and word are all accompanied by geometry (coordinates and dimensions) information. I originally concatenated words while discarding all geometry information. Perhaps taking into account the geometry of each word and line while outputting text would yield much better results. I could begin working on a better heuristic to concatenate text. What do you think?
On Tue, Nov 8, 2022 at 6:11 AM Ayyub Ibrahim @.***> wrote:
Just reviewed the OCR output for other data. It seems that the issue isn't limited to the post officer history reports, as the same formatting issue is present in the LSP reports. The issue, as I see it, is that Doctr processes successive strings as headers, and then subsequent successive strings as the body text of those headers, despite the original formatting which is usually one long string of text.
As an example, see how to, from, and subject on page 1 of the PDF vs how it is processed in Doctr (and Tesseract). See previous discipline on page 7 for another example. This formatting introduces ambiguity, similar to the post officer history reports, where it is difficult to determine what event a date or action corresponds to.
Doctr_beasley_william_investigative_report.txt https://github.com/ipno-llead/processing/files/9956241/Doctr_beasley_william_investigative_report.txt Tesseract_beasley_william_investigative_report.txt https://github.com/ipno-llead/processing/files/9956242/Tesseract_beasley_william_investigative_report.txt beasley_william_investigative_report.pdf https://github.com/ipno-llead/processing/files/9956243/beasley_william_investigative_report.pdf
— Reply to this email directly, view it on GitHub https://github.com/ipno-llead/processing/pull/408#issuecomment-1306348219, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBGADBHEY72AHK4BCDTWHGEDHANCNFSM6AAAAAARVAPFLU . You are receiving this because you were mentioned.Message ID: @.***>
Sounds perfect.
@pckhoi Assume that you've put in more than the 10hr recently. Happy to look into these changes if it's cool with you.
Do you mean changing the line concatenation heuristic? Sure. It will be a good computer science exercise for you haha.
@pckhoi Little luck. I know that I need to concatenate lines
who have geometric values that indicate they exist on the same line on the page, but I'm unable to translate that into code. Please go for it when you have the time. Look forward to reviewing.
Sure, I will contribute a solution toward the end of December if you don't mind.
No rush at all. Thanks.
@ayyubibrahimi I have put in a bottom-up layout engine that performs alright. For this page:
browning, dustin_notice_of_reprimand_page_1.pdf
This is the text extracted:
[
[
[
"JOHN BEL EDWARDS",
"GOVERNOR"
],
[
"KEVIN W. REEVES, COLONEL",
"DEPUTY SECRETARY"
]
],
[
[
"State of Louisiana",
"Department ofPublic Safety and Corrections",
"Public Safety Services"
]
],
[
[
"June 13,2019",
"4750/0501/R19-22437",
"HQ-2-1706-19"
]
],
[
[
"SERGEANT DUSTIN BROWNING",
"LSP- POLICE LOGISTICAL SERVICES"
],
[
"IA#19-017, OLA#063698",
"VIA PERSONAL DELIVERY"
]
],
[
[
"RE; LETTER OF REPRIMAND"
]
],
[
[
"Dear Sgt. Browning:"
]
],
[
[
"In accordance with State Police Commission Rules 12.1, 12.2, and the authority",
"delegated to me by Colonel Kevin Reeves, you are hereby formally reprimanded for the",
"following reasons:"
]
],
[
[
"During the course of an Internal Affairs (\"IA\") investigation into another matter, IA",
"investigators learned you failed to complete the required quarterly reconciliation reports of your",
"Investigative Expense (\"IE\") account while you were a Sergeant in the Region I Special",
"Victim's Unit. During a re\u00e7orded interview' with IA investigators on May 6, 2019, you admitted",
"that you and Lt. Chad Gremillion had discussed completing quarterly reconciliation reports but",
"had never actually done SO. You stated you did not know you had to complete quarterly",
"reconciliation reports because you believed it was Lt. Gremillion's responsibility to do so."
]
],
[
[
"As a Sergeant assigned to the Special Victim's Unit within the Special Investigations",
"Division of State Police, your duties included receiving, maintaining, and dispensing, IE funds,",
"and preparing quarterly reconciliation reports of your IE account. The Bureau of Investigation",
"Policy No. 4.10- Investigative Expense Money states in pertinent part:"
]
],
[
[
"Accountability of Investigative Expense Funds",
"The custodian shall reconcile the investigative expense fund quarterly.",
"In the reconciliation, each officer shall account for all funds disbursed",
"tol him.",
"A quarterly reconciliation report shall be prepared by each field",
"supervisor maintaining a portion of the investigative expense fund.",
"The reconciliation report shall be verified by the investigative expense",
"fund custodian utilizing reconciliation form."
]
],
[
[
"DPSSP4117"
],
[
"Acopy oft this interview is maintained by Internal Affairs and is available for your review upon your request.",
"COURTESY . LOYALTY . SERVICE",
"\"An Equal Opportunily Employer\"",
"P.O. BOX66614, BATON ROUGE, LOUISIANA 70896"
]
]
]
Rather than concatenating all texts, I organized lines into blocks and then paragraphs. The top level is the list of paragraphs ordered from top to bottom. Each paragraph in turn is a list of blocks ordered from left to right. 2 blocks are arranged in the same paragraph if they have overlapping y geometry.
So I think it performs alright. It does have edge cases that it really struggles with such as when there are lots of redactions. Please feel free to go through and inspect the result for yourself.
The script ocr/louisiana_state_pd_letters_2019.py
is now integrated with spot-check extension so make sure to install it.
Also ocr scripts no longer download ocr results by themselves so make sure to run make ocr_results
prior.
If this layout solution isn't good enough then it's time to explore object detection (this solution with detectron2 looks really promising).
Thanks @pckhoi! Will review tomorrow and circle back with any comments.
@pckhoi ocr/louisiana_state_pd_letters_2019.py
looks great. So do all other scripts in the ocr
stage, except for the ocr/post_officer_history_reports
. This is unfortunate because I expect all future document types to be similar in structure to ocr/louisiana_state_pd_letters_2019.py
. post_officer_history_reports
are therefore an anomaly. I have an idea re producing new training data for the model in ner/post_officer_history_reports
that I hope will solve the issue, so this to me seems like a great place to be. Thanks again.
No problem. If needed, we could always try to concatenate everything like Tesseract as an alternative strategy for anything that perform really bad.
That works for me.
@ayyubibrahimi I have updated the Github runner to include Doctr requirements. Please prioritize updating the ner scripts (error shown in the process-data job).
Review data changes at tx/200b3a26-be0e-4dd9-b1b4-c2e7b1b31ac3
When this PR is merged, this transaction will be applied.
Looks good @pckhoi
Awesome! Please merge this branch as soon as possible (you're the main maintainer now).
Transaction tx/200b3a26-be0e-4dd9-b1b4-c2e7b1b31ac3 applied.
This PR should completely integrate OCR job queue into the current pipeline. Note that from now on, we need to run each OCR script at least twice. The first time to enqueue pdf files for processing. And a second time to fetch the results. Perhaps I need to make the OCR scripts fail if there are unprocessed pdfs.