Feat/enable ocr jobqueue

pckhoi commented 1 year ago

This PR should completely integrate OCR job queue into the current pipeline. Note that from now on, we need to run each OCR script at least twice. The first time to enqueue pdf files for processing. And a second time to fetch the results. Perhaps I need to make the OCR scripts fail if there are unprocessed pdfs.

github-actions[bot] commented 1 year ago

Review data changes at tx/a4467f78-e0c5-4740-8362-e50158b20480

When this PR is merged, this transaction will be applied.

pckhoi commented 1 year ago

@ayyubibrahimi looks like the ner output have changed since ocr is now produced by doctr. Can you please go through the scripts in the ner stage, inspect their output and make sure they are not outputting anything strange? 🙏

The error from process-data job looks to me like the output of the ner stage have changed greatly. To update any output from the ner stage, invoke the script like this:

OCR_ENSURE_COMPLETE=true make -B data/ner/post_officer_history_reports_9_16_2022.csv

OCR_ENSURE_COMPLETE=true will make the script error out if there are pdf files not completely processed. -B remakes the file and all of its dependencies.

ayyubibrahimi commented 1 year ago

Hey @pckhoi you're right that the output of the ocr and ner stages are strange. The spacy model is now essentially useless (attached is an example of the data that the model has been trained on). That said, I think that it would be possible to produce a new model that works well because of the amount of training data that we have, but it would definitely be difficult if we had less training data because the output isn't as uniform as before.

example_ocr_post_officer_history_report.csv

pckhoi commented 1 year ago

@ayyubibrahimi I think Doctr should definitely produce better OCR results than Tesseract. Lemme finish the vscode extension so you can check.

If Doctr is better then how much effort do you think it take to retrain the Spacy model? Should we pace it out by keeping the Tesseract OCR results for now?

ayyubibrahimi commented 1 year ago

I agree with you. I think that the format of the post officer history report is the issue. I say we move forward with Doctr. I'll begin working on the new model.

pckhoi commented 1 year ago

Awesome! Feel free to commit to this branch.

ayyubibrahimi commented 1 year ago

Cool. Should be good in the next couple of few days.

ayyubibrahimi commented 1 year ago

@pckhoi I misspoke. Doctr won't work for the post officer history reports because it introduces ambiguity re the left_reason. See the attached examples. In the Tesseract example, it's clear from where and when Bobbit resigned.

Doctr_OCR_POST_Officer_History_Report.txt Tessertact_OCR_POST_Officer_History_Report.txt

ayyubibrahimi commented 1 year ago

Just reviewed the OCR output for other data. It seems that the issue isn't limited to the post officer history reports, as the same formatting issue is present in the LSP reports. The issue, as I see it, is that Doctr processes successive strings as headers, and then subsequent successive strings as the body text of those headers, despite the original formatting which is usually one long string of text.

As an example, see how to, from, and subject on page 1 of the PDF vs how it is processed in Doctr (and Tesseract). See previous discipline on page 7 for another example. This formatting introduces ambiguity, similar to the post officer history reports, where it is difficult to determine what event a date or action corresponds to.

Doctr_beasley_william_investigative_report.txt Tesseract_beasley_william_investigative_report.txt beasley_william_investigative_report.pdf

pckhoi commented 1 year ago

Doctr output blocks of text. Each block consists of lines. Each line consists of words. Block, line, and word are all accompanied by geometry (coordinates and dimensions) information. I originally concatenated words while discarding all geometry information. Perhaps taking into account the geometry of each word and line while outputting text would yield much better results. I could begin working on a better heuristic to concatenate text. What do you think?

On Tue, Nov 8, 2022 at 6:11 AM Ayyub Ibrahim @.***> wrote:

Just reviewed the OCR output for other data. It seems that the issue isn't limited to the post officer history reports, as the same formatting issue is present in the LSP reports. The issue, as I see it, is that Doctr processes successive strings as headers, and then subsequent successive strings as the body text of those headers, despite the original formatting which is usually one long string of text.

As an example, see how to, from, and subject on page 1 of the PDF vs how it is processed in Doctr (and Tesseract). See previous discipline on page 7 for another example. This formatting introduces ambiguity, similar to the post officer history reports, where it is difficult to determine what event a date or action corresponds to.

Doctr_beasley_william_investigative_report.txt https://github.com/ipno-llead/processing/files/9956241/Doctr_beasley_william_investigative_report.txt Tesseract_beasley_william_investigative_report.txt https://github.com/ipno-llead/processing/files/9956242/Tesseract_beasley_william_investigative_report.txt beasley_william_investigative_report.pdf https://github.com/ipno-llead/processing/files/9956243/beasley_william_investigative_report.pdf

— Reply to this email directly, view it on GitHub https://github.com/ipno-llead/processing/pull/408#issuecomment-1306348219, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC2RBGADBHEY72AHK4BCDTWHGEDHANCNFSM6AAAAAARVAPFLU . You are receiving this because you were mentioned.Message ID: @.***>

ayyubibrahimi commented 1 year ago

Sounds perfect.

ayyubibrahimi commented 1 year ago

@pckhoi Assume that you've put in more than the 10hr recently. Happy to look into these changes if it's cool with you.

pckhoi commented 1 year ago

Do you mean changing the line concatenation heuristic? Sure. It will be a good computer science exercise for you haha.

ayyubibrahimi commented 1 year ago

@pckhoi Little luck. I know that I need to concatenate lines who have geometric values that indicate they exist on the same line on the page, but I'm unable to translate that into code. Please go for it when you have the time. Look forward to reviewing.

pckhoi commented 1 year ago

Sure, I will contribute a solution toward the end of December if you don't mind.

ayyubibrahimi commented 1 year ago

No rush at all. Thanks.

pckhoi commented 1 year ago

@ayyubibrahimi I have put in a bottom-up layout engine that performs alright. For this page:

browning, dustin_notice_of_reprimand_page_1.pdf

This is the text extracted:

[
  [
    [
      "JOHN BEL EDWARDS",
      "GOVERNOR"
    ],
    [
      "KEVIN W. REEVES, COLONEL",
      "DEPUTY SECRETARY"
    ]
  ],
  [
    [
      "State of Louisiana",
      "Department ofPublic Safety and Corrections",
      "Public Safety Services"
    ]
  ],
  [
    [
      "June 13,2019",
      "4750/0501/R19-22437",
      "HQ-2-1706-19"
    ]
  ],
  [
    [
      "SERGEANT DUSTIN BROWNING",
      "LSP- POLICE LOGISTICAL SERVICES"
    ],
    [
      "IA#19-017, OLA#063698",
      "VIA PERSONAL DELIVERY"
    ]
  ],
  [
    [
      "RE; LETTER OF REPRIMAND"
    ]
  ],
  [
    [
      "Dear Sgt. Browning:"
    ]
  ],
  [
    [
      "In accordance with State Police Commission Rules 12.1, 12.2, and the authority",
      "delegated to me by Colonel Kevin Reeves, you are hereby formally reprimanded for the",
      "following reasons:"
    ]
  ],
  [
    [
      "During the course of an Internal Affairs (\"IA\") investigation into another matter, IA",
      "investigators learned you failed to complete the required quarterly reconciliation reports of your",
      "Investigative Expense (\"IE\") account while you were a Sergeant in the Region I Special",
      "Victim's Unit. During a re\u00e7orded interview' with IA investigators on May 6, 2019, you admitted",
      "that you and Lt. Chad Gremillion had discussed completing quarterly reconciliation reports but",
      "had never actually done SO. You stated you did not know you had to complete quarterly",
      "reconciliation reports because you believed it was Lt. Gremillion's responsibility to do so."
    ]
  ],
  [
    [
      "As a Sergeant assigned to the Special Victim's Unit within the Special Investigations",
      "Division of State Police, your duties included receiving, maintaining, and dispensing, IE funds,",
      "and preparing quarterly reconciliation reports of your IE account. The Bureau of Investigation",
      "Policy No. 4.10- Investigative Expense Money states in pertinent part:"
    ]
  ],
  [
    [
      "Accountability of Investigative Expense Funds",
      "The custodian shall reconcile the investigative expense fund quarterly.",
      "In the reconciliation, each officer shall account for all funds disbursed",
      "tol him.",
      "A quarterly reconciliation report shall be prepared by each field",
      "supervisor maintaining a portion of the investigative expense fund.",
      "The reconciliation report shall be verified by the investigative expense",
      "fund custodian utilizing reconciliation form."
    ]
  ],
  [
    [
      "DPSSP4117"
    ],
    [
      "Acopy oft this interview is maintained by Internal Affairs and is available for your review upon your request.",
      "COURTESY . LOYALTY . SERVICE",
      "\"An Equal Opportunily Employer\"",
      "P.O. BOX66614, BATON ROUGE, LOUISIANA 70896"
    ]
  ]
]

Rather than concatenating all texts, I organized lines into blocks and then paragraphs. The top level is the list of paragraphs ordered from top to bottom. Each paragraph in turn is a list of blocks ordered from left to right. 2 blocks are arranged in the same paragraph if they have overlapping y geometry.

So I think it performs alright. It does have edge cases that it really struggles with such as when there are lots of redactions. Please feel free to go through and inspect the result for yourself.

The script ocr/louisiana_state_pd_letters_2019.py is now integrated with spot-check extension so make sure to install it.

Also ocr scripts no longer download ocr results by themselves so make sure to run make ocr_results prior.

If this layout solution isn't good enough then it's time to explore object detection (this solution with detectron2 looks really promising).

ayyubibrahimi commented 1 year ago

Thanks @pckhoi! Will review tomorrow and circle back with any comments.

ayyubibrahimi commented 1 year ago

@pckhoi ocr/louisiana_state_pd_letters_2019.py looks great. So do all other scripts in the ocr stage, except for the ocr/post_officer_history_reports. This is unfortunate because I expect all future document types to be similar in structure to ocr/louisiana_state_pd_letters_2019.py. post_officer_history_reports are therefore an anomaly. I have an idea re producing new training data for the model in ner/post_officer_history_reports that I hope will solve the issue, so this to me seems like a great place to be. Thanks again.

pckhoi commented 1 year ago

No problem. If needed, we could always try to concatenate everything like Tesseract as an alternative strategy for anything that perform really bad.

ayyubibrahimi commented 1 year ago

That works for me.

pckhoi commented 1 year ago

@ayyubibrahimi I have updated the Github runner to include Doctr requirements. Please prioritize updating the ner scripts (error shown in the process-data job).

github-actions[bot] commented 1 year ago

Review data changes at tx/200b3a26-be0e-4dd9-b1b4-c2e7b1b31ac3

When this PR is merged, this transaction will be applied.

ayyubibrahimi commented 1 year ago

Looks good @pckhoi

pckhoi commented 1 year ago

Awesome! Please merge this branch as soon as possible (you're the main maintainer now).

github-actions[bot] commented 1 year ago

Transaction tx/200b3a26-be0e-4dd9-b1b4-c2e7b1b31ac3 applied.

ipno-llead / processing

Feat/enable ocr jobqueue #408