Closed zymbuzz closed 3 months ago
Thank you for your interest and appreciation! No issue is irrelevant 😉.
A few words on the specifics for some of the many supported document types:
There are documents with a fixed page layout like PDF or XPS. And then there are others with a variable layout which we call "reflowable".
Examples for reflowable types are e-books (EPUB, MOBI), but also HTML, TEXT and Office documents. For these types, PyMuPDF is forced to start with standard assumptions about page size (400 x 600) because there exist no recognizable page breaks and page widths.
However, PyMuPDF is capable to change these standard assumptions after open via its method Document.layout()
.
We are planning to offer additional arguments for method to_markdown
like page_width
together with the option to assume that the whole reflowable document consists of just one page with that width. Maybe page_height=infinite
with some suitable expression.
Based on this, you wouldn't ever see again those page separator "-----".
Solved in version 0.0.6.
Hello, and thanks a lot for this encouraging project.
I want to bring to your attention the issue arising when importing text files. myselectdoc.txt
When applying the to_markdown function, the text is chunked into pieces with the separator "-----". Is it possible to disable chunking?
When I specify input margins=[0, 0, 0, 0], some text is not lost, but chunking still does not happen.
The issue may be irrelevant to your project goal, but it might be interesting because you allow for non-pdf files.