jzillmann / pdf-to-markdown

A PDF to Markdown converter
https://pdf2md.morethan.io
MIT License
1.2k stars 195 forks source link

DetectTOC: only go for lines containing '...' words #11

Open LoneRifle opened 5 years ago

LoneRifle commented 5 years ago

DetectTOC will work on all lines, shaving off numbers from the last of the words in a given line, so long as the word is not all full-stops. This implies that a TOC line is one that contains strings containing only full-stops, and so, DetectTOC should only work on such lines.

This change will remove unwanted behaviour where DetectTOC removes trailing numbers that we actually want to keep in lines, eg:

Case Number : ABC 12/1234

This PR is a backport of opendocsg/pdf2md#33