mgufrone / pdf-to-html

PDF to HTML PHP Class using Poppler-Utils
MIT License
175 stars 88 forks source link

HTML Output Class meaning #53

Open Axel-KIRK opened 5 years ago

Axel-KIRK commented 5 years ago

Hi, First of all, many thanks for this code, it works very well !

I have my html from a pdf parsing, and there is many class like ft03 ft04 ft02 ft01 (which seem to be the content) and ft08 ft09 ... (which seems to be other thing). But as I read the full HTML code, there is no real logic: For example: from a page to another, a simple text content without style would be ft03 and next page ft04 et next page again ft02 ...

image

I want extract and sort each pdf text content according to his own hierarchy, that's why I want to analyse these class.

If someone have some idea ? Thank's by advance,