Closed tanchangsheng closed 1 day ago
This is not a bug: The header identification algorithm determines the most frequent font size and sets it as the body text. Everything smaller will also be treated as body text. The maximum 6 font sizes will be treated as headers h1 - h6. Any font size larger than body text but smaller than font size of h6 will be treated like h6. We all know that this algorithm is an approximation of any document's truth. Use your own logic if you cannot agree with this approach.
Thanks for the clarification!
Normal body text words has been parsed as headers.
file: example.pdf