pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

Enhanced handling of line breaks in pdf #127

Open zkn365 opened 3 weeks ago

zkn365 commented 3 weeks ago

I used the following function to enhance the handling of line breaks in pdf after it converted into markdown. I hope it could be considered in the next revision, thanks!

def remove_pdf_newlines(text):
    # Convert Windows-style newlines to Unix style
    text = text.replace('\r\n', '\n')
    # Merge lines that do not end with a period, question mark, or exclamation point
    text = re.sub(r'(?<![.!?])\n(?=[a-zA-Z])', ' ', text)
    # Preserve newlines between paragraphs
    text = re.sub(r'\n\s*\n', '\n\n', text)
    # Remove trailing whitespace characters from lines
    text = re.sub(r'[ \t]+$', '', text, flags=re.MULTILINE)
    return text.strip()