jennis0 / burdoc

Advanced PDF parsing for python
MIT License
4 stars 2 forks source link

Repeated outlines and also repeated text from above the outline. #7

Closed MrUnknown789556 closed 1 year ago

MrUnknown789556 commented 1 year ago

There is a systematic repeating of every outline once, and also of some part of the text string once from above the outline. It is the case for all outlines thorough the entire text.

Three examples from the same PDF are given here. One example for the "Abstract" outline, the other example for the "Introduction" outline, and the third for outline "2.1. Materials":

K. Rahmani aMechanical Engineering Department, Bu-Ali Sina University, Hamedan, Iran; bDepartment of Mechanical Engineering, Najafabad Branch a ABSTRACT ABSTRACT In this paper, fabrication and characterisation of Mg\u2013SiC nanocomposite are investigated.

In this paper, fabrication and characterisation of Mg 1. Introduction

  1. Introduction Pure magnesium (Mg) and its alloys are widely used in different applications in aerospace, automotive, sports, biomedical, and electronics industries.

Recently, Atrian et al. [ Dynamic powder compaction is employed in this work to fabricate the specimens. Nanocomposites with Mg powder as the matrix and SiC nano particles as the reinforcement are manufactured using a modified SHPB. Then, the role of reinforcing phase (SiC vol.-%) on density, compressive flow stress, micro-hardness, and microstructural evolution of the specimens are investigated. Dynamic powder compaction is employed in this 2. Experimental procedure

  1. Experimental procedure 2.1. Materials 2.1. Materials Commercial Mg powder (size range of 60\u2013150 \u03bcm, Commercial Mg powder (size range of 60 99.5% purity, irregular morphology, Merck, Germany)

Screenshot_5

jennis0 commented 1 year ago

Apologies, I've not been able to duplicate this, what command line options are you using?

MrUnknown789556 commented 1 year ago

Maybe it is me, that don't quite understand what the generated text (JSON format) should contain.

I converted the PDF file into a JSON formatted file from the command line ("burdoc Nov.pdf"). Then I look into this JSON file for specific text strings. I got the same text string multiple times, even it is only listed one time in the PDF file.

What I don't understand is (as examples):

1) "Abstract"

It is seen 4 times in the JSON file, but only one time in the PDF file.

In the JSON file, it listed both under a "block_text" one time and under a "text" two times (beside under "toc" off course):

2) Outline "2.1. Materials"

In the JSON file, it listed both under a "block_text" and under a "text" (beside under "toc" off course):

"block_text": "2.1. Materials", "items": [{"spans": [{"text": "2.1. Materials", "font":

I also converted the JSON file to a text formatted file online: https://onlinejsontools.com/convert-json-to-text. In the converted text the word "Abstract" are here also seen 4 times.

If I convert the PDF file to html format, I see no duplicates at all. Maybe all text from the PDF file are and should all be duplicated in the JSON file in "block_text and in "text"? output-onlinejsontools.txt Nov.zip Nov.pdf

jennis0 commented 1 year ago

Ah yes. Essentially the "block_text" field is simply a concatenation of the text of each line within that block. Personally I found it useful in simplifying processing the data when distinguishing between lines in a block wasn't necessary but maybe's it something I should rethink.

MrUnknown789556 commented 1 year ago

If the text should be doubled in the JSON-formatted file (in "block_text" and in "text"), maybe let that be an option to have it duplicated or not? For me, I look at a large text like this: It consists of 1) some single lines (one line with an empty line above and below the line) and 2) some sections (multiple consecutive lines forming a single block of text with an empty line above and below the block). I think this is useful in many cases. Still, I think having the word "Abstract" mentioned four times should not be necessary. At least it is very confusing if I have to look for an abstract programmatically in the JSON-formatted file.  Best regardsFrank Den fredag den 26. maj 2023 kl. 20.53.33 CEST skrev jennis0 @.***>:

Ah yes. Essentially the "block_text" field is simply a concatenation of the text of each line within that block. Personally I found it useful in simplifying processing the data when distinguishing between lines in a block wasn't necessary but maybe's it something I should rethink.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: @.***>