jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

extract_words sometimes extracts chracters from multiple lines and forms them as words #400

Closed sreeni5493 closed 2 years ago

sreeni5493 commented 3 years ago

Artwork_cokeA.pdf In this PDF, for the text block "50 CALORIES PER CAN". extract words returns the output as the following: 50CALORIESPER as single word, then PER as another word and CAN as another word.

page = page.crop((0, 0, page.width, page.height)) words = page.extract_words(use_text_flow=True)

Issue is if I add extra_attrs like 'top' or 'bottom' to divide this, vertical text (90 degree oriented text) starts splitting.

Same issue in the huge paragarph: "not recommended for individuals under 18 years of age, pregnant or nursing women, or for those sensitive to caffeine. daily caffeine consumption should be limited to approximately 400 mg per day from all sources. this product has 80 mg per package. too much caffeine may cause nervousness, irritability, sleeplessness and occasionally, rapid heartbeat."

Here "caffeine.daily" is extracted as a word, even though they are very far apart.

Is there any fix for this. Also why does upright characters from visually different lines get combined as words?

Additionally, can the extract_words function return all the characters that are present in the words.

Under merge_chars function in utils.py, this code needs to be added word = { "text": "".join(map(itemgetter("text"), ordered_chars)), "x0": x0, "x1": x1, "top": top, "bottom": bottom, "upright": upright, "direction": direction, "chars": ordered_chars }

Getting chars would be of huge help. When highlighting of characters are needed, their parameters would be useful. Example of such use case, when there are curved text and you would want to combine curved text as words and want to display them in PDF with their locations cleanly, you would need each character's location after extracting words.

samkit-jain commented 3 years ago

Hi @sreeni5493 If you draw the characters' bounding boxes, you'll notice that the word 50, actually stretches to the bottom and intersects with the text below. Screenshot: image Because of this overlap, the word is getting extracted as 50CALORIESPER CAN and not 50 CALORIES PER CAN.

You may use the following code to visualise the bounding boxes

im = page.to_image(resolution=200)
im.draw_rects(page.chars)
im.save("image.png", format="PNG")

To get the text separated, you can pass in size as an extra attribute to the .extract_words(...) method.

Before:

" ".join([word["text"] for word in page.extract_words(use_text_flow=True)])
50CALORIESPER CAN 11.5 FL OZ (340 mL) energy beverage sparkling 2013-G106please recycle ingredients: carbonated reverse osmosis water, cane sugar, less than 0.5% of: citric acid, natural flavors, vitamin C (ascorbic acid), fruit and vegetable juice (color), green coffee bean extract, stevia leaf extract, magnesium lactate, vitamin B5 (calcium pantothenate), potassium phosphate, calcium lactate, vitamin B6 (pyridoxine hydrochloride), vitamin B12. not recommended for individuals under 18 years of age, pregnant or nursing women, or for those sensitive to caffeine.daily caffeine consumption should be limited to approximately 400 mg per day from all sources. this product has 80 mg per package. too much caffeine may cause nervousness, irritability, sleeplessness and occasionally, rapid heartbeat. caffeine content: 80 mg caffeine from green coffee bean extract/11.5 FL OZ made for glacéau, new york, ny 10016 1-877-GLACEAU © 2014 glaceau. glaceau, vitaminwater, and the label design design are trademarks of glaceau. Total Fat 0g 0% Total Carbohydrate 13g 4% Protein 0g Serving Size 1 Can Amount Per Serving Calories 50 % Daily Value* Sugars 13g Nutrition Facts *Percent Daily Values are based on a 2,000 calorie diet. Sodium 0mg 0% Vitamin C 40% (cid:127) Vitamin B5 60% Vitamin B6 60% (cid:127) Vitamin B12 60% Not a significant source of calories from fat, saturated fat, trans fat, cholesterol, dietary fiber, vitamin A, vitamin C, calcium, and iron. “man! i wish i had less energy.” said no one, ever. no added preservatives & no sodium see nutrition facts for more details natural sweeteners & natural flavors lightly carbonated *natural energy boost from green coffee bean extract excellent source of vitamins b5, b6, b12 & c strawberrylime flavored + other natural flavors natural energy boost* 11.5 FL OZ FPO7-86162-00435-2 Substrate 2014-0175NA2812AluminumNAVitaminWater Energy TBD FinishedArtDS0102/10/201440299StrawberryLime 12oz LtCarb Notes/Comments: CCATS Num Supplier Grid Prod. Grid SubstratePrinterPromo Name Linescreen SupplierProd Artist Cycle Status Prod Date Job Number File Name THIS PROOF IS FOR COPY, CONTENT AND LAYOUT ONLY. NOT TO BE USED FOR COLOR APPROVAL. www.finishedart.com 404.355.7902 PRODUCTIONP Black - 1299071 Green 375 - 1152898 Blue 305 - 1286768 Bright White - 1215947

After:

" ".join([word["text"] for word in page.extract_words(use_text_flow=True, extra_attrs=["size", "top"])])
50 CALORIES PER CAN 11.5 FL OZ ( 340 mL ) energy beverage sparkling 2013-G106 please recycle ingredients: carbonated reverse osmosis water, cane sugar, less than 0.5% of: citric acid, natural flavors, vitamin C (ascorbic acid), fruit and vegetable juice (color), green coffee bean extract, stevia leaf extract, magnesium lactate, vitamin B5 (calcium pantothenate), potassium phosphate, calcium lactate, vitamin B6 (pyridoxine hydrochloride), vitamin B12. not recommended for individuals under 18 years of age, pregnant or nursing women, or for those sensitive to caffeine. daily caffeine consumption should be limited to approximately 400 mg per day from all sources. this product has 80 mg per package. too much caffeine may cause nervousness, irritability, sleeplessness and occasionally, rapid heartbeat. caffeine content: 80 mg caffeine from green coffee bean extract/11.5 FL OZ m a d e f o r g l a c é a u , n e w y o r k , n y 1 0 0 1 6 1 - 8 7 7 - G L A C E A U © 2014 glaceau. glaceau, vitaminwater, and the label design design are trademarks of glaceau. Total Fat 0g 0 % Total Carbohydrate 13g 4 % Protein 0g Serving Size 1 Can Amount Per Serving Calories 50 % Daily Value* Sugars 13g Nutrition Facts *Percent Daily Values are based on a 2,000 calorie diet. Sodium 0mg 0 % Vitamin C 40% (cid:127) Vitamin B5 60% Vitamin B6 60% (cid:127) Vitamin B12 60% Not a significant source of calories from fat, saturated fat, trans fat, cholesterol, dietary fiber, vitamin A, vitamin C, calcium, and iron. “man! i wish i had less energy.” said no one, ever. no added preservatives & no sodium see nutrition facts for more details natural sweeteners & natural flavors lightly carbonated *natural energy boost from green coffee bean extract excellent source of vitamins b5, b6, b12 & c strawberry lime flavored + other natural flavors natural energy boost * 1 1 . 5 F L O Z F P O 7 - 8 6 1 6 2 - 0 0 4 3 5 - 2 Substrate 2014-0175 NA 2812 Aluminum NA VitaminWater Energy TBD FinishedArt DS 01 02/10/2014 40299 StrawberryLime 12oz LtCarb Notes/Comments: CCATS Num Supplier Grid Prod. Grid Substrate Printer Promo Name Linescreen Supplier Prod Artist Cycle Status Prod Date Job Number File Name THIS PROOF IS FOR COPY, CONTENT AND LAYOUT ONLY. NOT TO BE USED FOR COLOR APPROVAL. www.finishedart.com 404.355.7902 P R O D U C T I O N P Black - 1299071 Green 375 - 1152898 Blue 305 - 1286768 Bright White - 1215947

Yes, of course, this comes at the cost that the vertical text is now split by character.

samkit-jain commented 3 years ago

With

" ".join([word["text"] for word in page.extract_words(use_text_flow=True, extra_attrs=["x1"])])

you'll get the text as

5 0 C A L O R I E S P E R C A N 1 1 . 5 F L O Z ( 3 4 0 m L ) e n e r g y b e v e r a g e s p a r k l i n g 2 0 1 3 - G 1 0 6 p l e a s e r e c y c l e i n g r e d i e n t s : c a r b o n a t e d r e v e r s e o s m o s i s w a t e r , c a n e s u g a r , l e s s t h a n 0 . 5 % o f : c i t r i c a c i d , n a t u r a l fl a v o r s , v i t a m i n C ( a s c o r b i c a c i d ) , f r u i t a n d v e g e t a b l e j u i c e ( c o l o r ) , g r e e n c o f f e e b e a n e x t r a c t , s t e v i a l e a f e x t r a c t , m a g n e s i u m l a c t a t e , v i t a m i n B 5 ( c a l c i u m p a n t o t h e n a t e ) , p o t a s s i u m p h o s p h a t e , c a l c i u m l a c t a t e , v i t a m i n B 6 ( p y r i d o x i n e h y d r o c h l o r i d e ) , v i t a m i n B 1 2 . n o t r e c o m m e n d e d f o r i n d i v i d u a l s u n d e r 1 8 y e a r s o f a g e , p r e g n a n t o r n u r s i n g w o m e n , o r f o r t h o s e s e n s i t i v e t o c a f f e i n e . d a i l y c a f f e i n e c o n s u m p t i o n s h o u l d b e l i m i t e d t o a p p r o x i m a t e l y 4 0 0 m g p e r d a y f r o m a l l s o u r c e s . t h i s p r o d u c t h a s 8 0 m g p e r p a c k a g e . t o o m u c h c a f f e i n e m a y c a u s e n e r v o u s n e s s , i r r i t a b i l i t y , s l e e p l e s s n e s s a n d o c c a s i o n a l l y , r a p i d h e a r t b e a t . c a f f e i n e c o n t e n t : 8 0 m g c a f f e i n e f r o m g r e e n c o f f e e b e a n e x t r a c t / 1 1 . 5 F L O Z made for glacéau, new york, ny 10016 1-877-GLACEAU © 2 0 1 4 g l a c e a u . g l a c e a u , v i t a m i n w a t e r , a n d t h e l a b e l d e s i g n d e s i g n a r e t r a d e m a r k s o f g l a c e a u . T o t a l F a t 0 g 0 % T o t a l C a r b o h y d r a t e 1 3 g 4 % P r o t e i n 0 g S e r v i n g S i z e 1 C a n A m o u n t P e r S e r v i n g C a l o r i e s 5 0 % D a i l y V a l u e * S u g a r s 1 3 g N u t r i t i o n F a c t s * P e r c e n t D a i l y V a l u e s a r e b a s e d o n a 2 , 0 0 0 c a l o r i e d i e t . S o d i u m 0 m g 0 % V i t a m i n C 4 0 % (cid:127) V i t a m i n B 5 6 0 % V i t a m i n B 6 6 0 % (cid:127) V i t a m i n B 1 2 6 0 % N o t a s i g n i fi c a n t s o u r c e o f c a l o r i e s f r o m f a t , s a t u r a t e d f a t , t r a n s f a t , c h o l e s t e r o l , d i e t a r y fi b e r , v i t a m i n A , v i t a m i n C , c a l c i u m , a n d i r o n . “ m a n ! i w i s h i h a d l e s s e n e r g y . ” s a i d n o o n e , e v e r . n o a d d e d p r e s e r v a t i v e s & n o s o d i u m s e e n u t r i t i o n f a c t s f o r m o r e d e t a i l s n a t u r a l s w e e t e n e r s & n a t u r a l fl a v o r s l i g h t l y c a r b o n a t e d * n a t u r a l e n e r g y b o o s t f r o m g r e e n c o f f e e b e a n e x t r a c t e x c e l l e n t s o u r c e o f v i t a m i n s b 5 , b 6 , b 1 2 & c s t r a w b e r r y l i m e fl a v o r e d + o t h e r n a t u r a l fl a v o r s n a t u r a l e n e r g y b o o s t * 11.5 FL OZ FPO 7-86162-00435-2 S u b s t r a t e 2 0 1 4 - 0 1 7 5 N A 2 8 1 2 A l u m i n u m N A V i t a m i n W a t e r E n e r g y T B D F i n i s h e d A r t D S 0 1 0 2 / 1 0 / 2 0 1 4 4 0 2 9 9 S t r a w b e r r y L i m e 1 2 o z L t C a r b N o t e s / C o m m e n t s : C C A T S N u m S u p p l i e r G r i d P r o d . G r i d S u b s t r a t e P r i n t e r P r o m o N a m e L i n e s c r e e n S u p p l i e r P r o d A r t i s t C y c l e S t a t u s P r o d D a t e J o b N u m b e r F i l e N a m e T H I S P R O O F I S F O R C O P Y , C O N T E N T A N D L A Y O U T O N L Y . N O T T O B E U S E D F O R C O L O R A P P R O V A L . w w w . fi n i s h e d a r t . c o m 4 0 4 . 3 5 5 . 7 9 0 2 P R O D U C T I O N P B l a c k - 1 2 9 9 0 7 1 G r e e n 3 7 5 - 1 1 5 2 8 9 8 B l u e 3 0 5 - 1 2 8 6 7 6 8 B r i g h t W h i t e - 1 2 1 5 9 4 7

Now, you have 2 strings - one that has the horizontal characters separated by a space and one that has the vertical characters. Combining them together will yield the proper result.

A simpler alternative would have been to provide a custom attribute to the extra_attrs in which if the character is horizontal, we use the height and if vertical, use the width as size. But I don't think that is possible at the moment.

sreeni5493 commented 3 years ago

Hi,

Isnt it a better algorithm to use distance between bottom of a character and bottom of subsequent character. For example:

in the above example '0' (from 50) and 'C' (from Calories) have their bottom of their character much far apart than "0" (from 50) and "5" (from 50).

So for upright, we should ideally use distance between both "bottom" and then threshold. For non upright characters we should use "x0" for both or "x1" for both.

jsvine commented 2 years ago

Going through this repo's issues, and following up on this one, though I realize it's a bit late.

Isnt it a better algorithm to use distance between bottom of a character and bottom of subsequent character.

This is an interesting suggestion but I think the current implementation matches most use-cases somewhat better. One somewhat common counter-example: text with subscript/superscript, where you do want those letters to be considered as part of the same line.

Unfortunately, there will always be edge-cases one way or another, but the distance-between-characters (rather than distance-between-bottoms, e.g.) seems to be more consistent in PDFs. This is especially true on the x-axis, where the widths of characters within words can vary substantially (note, e.g., the difference in widths between . and W).

In this particular case (though also generally for vertically-overlapping characters), you can also pass a negative y_tolerance to tell pdfplumber to separate the overlapping letters:

words = page.extract_words(use_text_flow=True, y_tolerance=-5)
print(" ".join([word["text"] for word in words]))

... produces:

50 CALORIES PER CAN 11.5 FL OZ (340 mL) energy beverage sparkling 2013-G106 please recycle ingredients: carbonated reverse osmosis water, cane sugar, less than 0.5% of: citric acid, natural flavors, vitamin C (ascorbic acid), fruit and vegetable juice (color), green coffee bean extract, stevia leaf extract, magnesium lactate, vitamin B5 (calcium pantothenate), potassium phosphate, calcium lactate, vitamin B6 (pyridoxine hydrochloride), vitamin B12. not recommended for individuals under 18 years of age, pregnant or nursing women, or for those sensitive to caffeine. daily caffeine consumption should be limited to approximately 400 mg per day from all sources. this product has 80 mg per package. too much caffeine may cause nervousness, irritability, sleeplessness and occasionally, rapid heartbeat. caffeine content: 80 mg caffeine from green coffee bean extract/11.5 FL OZ made for glacéau, new york, ny 10016 1-877-GLACEAU © 2014 glaceau. glaceau, vitaminwater, and the label design design are trademarks of glaceau. Total Fat 0g 0% Total Carbohydrate 13g 4% Protein 0g Serving Size 1 Can Amount Per Serving Calories 50 % Daily Value* Sugars 13g Nutrition Facts *Percent Daily Values are based on a 2,000 calorie diet. Sodium 0mg 0% Vitamin C 40% (cid:127) Vitamin B5 60% Vitamin B6 60% (cid:127) Vitamin B12 60% Not a significant source of calories from fat, saturated fat, trans fat, cholesterol, dietary fiber, vitamin A, vitamin C, calcium, and iron. “man! i wish i had less energy.” said no one, ever. no added preservatives & no sodium see nutrition facts for more details natural sweeteners & natural flavors lightly carbonated *natural energy boost from green coffee bean extract excellent source of vitamins b5, b6, b12 & c strawberry lime flavored + other natural flavors natural energy boost* 11.5 FL OZ FPO7-86162-00435-2 Substrate 2014-0175 NA 2812 Aluminum NA VitaminWater Energy TBD FinishedArt DS 01 02/10/2014 40299 StrawberryLime 12oz LtCarb Notes/Comments: CCATS Num Supplier Grid Prod. Grid Substrate Printer Promo Name Linescreen Supplier Prod Artist Cycle Status Prod Date Job Number File Name THIS PROOF IS FOR COPY, CONTENT AND LAYOUT ONLY. NOT TO BE USED FOR COLOR APPROVAL. www.finishedart.com 404.355.7902 P R O D U C T I O N P Black - 1299071 Green 375 - 1152898 Blue 305 - 1286768 Bright White - 1215947