The `Docx` parser doesn't properly parse sentences that are separated by a newline (`\n`)

nleroy917 / textractor

A simple text extractor for various files. Includes core functionality for extracting text from files, a command-line interface, restful API, and python bindings.

1 stars 0 forks source link

The `Docx` parser doesn't properly parse sentences that are separated by a newline (`\n`) #3

Closed nleroy917 closed 4 weeks ago

nleroy917 commented 1 month ago

I think that the .docx parser isn't correctly extracting text when its separated by a newline. For example:

This will result in: This is some text hereHere is some new text

Ideally, it should just be a space (or preserve that new line).

nleroy917 commented 1 month ago

An idea is to just push a blank space for each docx_rs::RunChild::Text the code encounters:

match child {
    docx_rs::RunChild::Text(text) => {
        document_text.push_str(&text.text);
        document_text.push(' '); // push a space for new lines
    },
    _ => todo!(),
}

nleroy917 commented 1 month ago

Seems to help:

I wonder if it makes sense to just put a new line instead of a space?

nleroy917 commented 1 month ago

I'll revert to a \n to try to retain the formatting of the original document