Closed Willianwg closed 11 months ago
same issue here, is there anyone who solved it ?
This is a very common problem when parsing PDF documents. In PDFs, each sentence is divided into too many items making it difficult to avoid problems by simply joining the items.
Modify the following part of the PDFLoader to create and use a new Custom Document Loader.
from:
const text = content.items.map(item => (item as TextItem).str).join('\n')
to:
let lastY = undefined
const textItems = []
for (const item of content.items) {
if ('str' in item) {
if (lastY == item.transform[5] || !lastY) {
textItems.push(item.str)
} else {
textItems.push(`\n${item.str}`)
}
lastY = item.transform[5]
}
}
const text = textItems.join('')
The method above mimics the original layout of the text by adding a newline character each time the y-coordinate value changes. Although this method isn't perfect, it can provide fairly appropriate results in general cases.
before:
after:
This is awesome - .join(' ')
seems better for a PDF I tried and an extra space is probably better than no extra space. Sorry for losing track of this one.
This is awesome -
.join(' ')
seems better for a PDF I tried and an extra space is probably better than no extra space. Sorry for losing track of this one.
Sorry I don't get it. In this problem the '/n' is returning between the words. For example, the word "person" would output "per/n" + "son/n" or something like that.
I had a pdf that returned no spaces with .join("")
in between - see the added test
Not sure which which is more common in the wild or if there's a way to get it working better for both
When i try to load a large PDF using PDFLoader, the documents are returned like this:
If i run only the pdf-parse, it returns:
Looks like the pdf-parse returns the whole content with no space between the words, and the loader creates the documents adding these '\n' ... Any idea of how to solve this?