jzillmann / pdf-to-markdown

A PDF to Markdown converter
https://pdf2md.morethan.io
MIT License
1.2k stars 196 forks source link

When a PDF breaks to a new line, Markdown will also break to a new line. #79

Closed Ryokki closed 3 weeks ago

Ryokki commented 3 weeks ago

When a PDF breaks to a new line, Markdown will also break to a new line. Is there any configuration to change this behavior? I want the text to be continuous.

darkcheftar commented 3 weeks ago

Markdown is basically text, so it should not be a problem to remove multiple new lines

Ryokki commented 3 weeks ago

Thank you so much for providing such a great tool! @darkcheftar You are right, it's not a tricky problem I would greatly appreciate it if this feature could be built in ~

import re
import sys

def fix_markdown_linebreaks(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        content = f.read()

    fixed_content = re.sub(r'([^\n])\n(?![\n\s#-])', r'\1 ', content)

    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(fixed_content)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("usage: python script.py <input_file> <output_file>")
        sys.exit(1)

    input_file = sys.argv[1]
    output_file = sys.argv[2]
    fix_markdown_linebreaks(input_file, output_file)
    print('done!')
darkcheftar commented 3 weeks ago

Hey @Ryokki, Thanks for @jzillmann he is the actual owner for the tool, I just love to help. And hopefully I am helpful.