jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

PDF Plumber not extracting tables correctly (text is parsed line by line) #618

Closed arthurthlee closed 2 years ago

arthurthlee commented 2 years ago

Hi, I'm running into an issue with PDFplumber.

I am attempting to parse this document: https://healthalliance.org/Cms/Media?uri=https%3A%2F%2Fhealthalliance.org%2Fmedia%2Fresources%2Fmed-preauth-drugs.pdf

image

I am attempting to extract out the "products affected" bullet points and the sections and their text such as "Indications -> All FDA-approved Indications."

I'm using horizontal_strategy: text, vertical_strategy: text, min_words_vertical: 2, and keep_blank_chars: 2.

However, the result of this is something like

AAT DEFICIENCY

Products Affected: •Prolastin-c

•Aralast Np INJ 1000MG, 500MG,•Zemaira

800MG

•Glassia

PA Criteria: Criteria Details

Indications: All FDA-approved Indications.

Off-Label Uses: N/A

Exclusion: N/A

Criteria

Required: DOC OF HIGH-RISK PHENOTYPE (E.G. PIZZ,PIZ(NULL),

Medical: PI(NULL)(NULL), PLASMA AAT LEVEL BELOW 11 MICROMOL/L

Information: (CORRESPONDING TO 80M EQUAL TO 35% AND LESS TO COMPLY WITH PROTOCG/D THA OLL) FEV1 GREATER THAN OR N 80% OF PREDICTED ABILITY FOR ADMINISTRATION

It looks like PDFPlumber is completely disregarding both the structure in the "Products Affected" section, and the lines in the PA Criteria/Criteria Details section, and printing one line at a time.

Is there something I'm doing wrong?

Thanks