Enhanced handling of line breaks in pdf

I used the following function to enhance the handling of line breaks in pdf after it converted into markdown. I hope it could be considered in the next revision, thanks!

def remove_pdf_newlines(text):
    # Convert Windows-style newlines to Unix style
    text = text.replace('\r\n', '\n')
    # Merge lines that do not end with a period, question mark, or exclamation point
    text = re.sub(r'(?<![.!?])\n(?=[a-zA-Z])', ' ', text)
    # Preserve newlines between paragraphs
    text = re.sub(r'\n\s*\n', '\n\n', text)
    # Remove trailing whitespace characters from lines
    text = re.sub(r'[ \t]+$', '', text, flags=re.MULTILINE)
    return text.strip()

pymupdf / RAG

Enhanced handling of line breaks in pdf #127