langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
12.61k stars 2.16k forks source link

PDF loader returning content including '\n' between words #1703

Closed Willianwg closed 11 months ago

Willianwg commented 1 year ago

When i try to load a large PDF using PDFLoader, the documents are returned like this:

 Document {
      pageContent: 'CURSO\n' +
        'CI\n' +
        'Ê\n' +
        'NCIAS\n' +
        'BIOL\n' +
        'Ó\n' +
        'GICAS\n' +
        '(Licenciatura)\n' +
        'DURA\n' +
        'ÇÃ\n' +
        'O\n' +
        'Dura\n' +
        'çã\n' +
        'o:\n' +
        'M\n' +
        'í\n' +
        'nima\n' +
        'de\n' +
        '03\n' +
        'anos\n' +
         ...,
        'e\n' +
        'Curso:\n' +
        'Portaria\n' +
        'MEC\n' +
        'n.\n' +
        'º\n' +
        '314\n' +
        'de\n' +
        '02/08/2011.\n' +
        'PERFIL',
      metadata: [Object]
    },

If i run only the pdf-parse, it returns:

CURSOCIÊNCIASBIOLÓGICAS(Licenciatura)
DURAÇÃODuração:Mínimade03anoseMáximade06anos.
SITUAÇÃO
Criação:Resoluçãon.º025/91-CUNI-UFRR.
RenovaçãodeReconhecimentodeCurso:PortariaMECn.º286de21/12/2012.
PERFIL
PROFISSIONAL
OobjetivodoCursodeLicenciaturaemCiênciasBiológicaségarantiraofuturoLicenciadoumaformação
profissionalsólidaeampla,baseadanumaintegraçãodasdiversasáreasdaBiologia,comascompetências,
habilidadeseposturasquepermitamaolicenciadoaquiformadoplenaatuaçãonoensino,alémdepesquisa
eextensãoemtodasasáreasdaBiologia.Aduraçãomínimadocursoserádeoitosemestrescomumtotal
de3.500horas.
CURSOCIÊNCIASCONTÁBEIS(Bacharelado)
DURAÇÃODuração:Mínimade04anoseMáximade06anos.
SITUAÇÃO
Criação:Resoluçãon.º025/91-CUNI-UFRR.
RenovaçãodeReconhecimentodeCurso:PortariaMECn.º706de18/12/2013.
PERFIL

Looks like the pdf-parse returns the whole content with no space between the words, and the loader creates the documents adding these '\n' ... Any idea of how to solve this?

ElodieComte commented 1 year ago

same issue here, is there anyone who solved it ?

NoUnique commented 1 year ago

This is a very common problem when parsing PDF documents. In PDFs, each sentence is divided into too many items making it difficult to avoid problems by simply joining the items.

Modify the following part of the PDFLoader to create and use a new Custom Document Loader.

from:

 const text = content.items.map(item => (item as TextItem).str).join('\n')

to:

      let lastY = undefined
      const textItems = []
      for (const item of content.items) {
        if ('str' in item) {
          if (lastY == item.transform[5] || !lastY) {
            textItems.push(item.str)
          } else {
            textItems.push(`\n${item.str}`)
          }
          lastY = item.transform[5]
        }
      }
      const text = textItems.join('')

The method above mimics the original layout of the text by adding a newline character each time the y-coordinate value changes. Although this method isn't perfect, it can provide fairly appropriate results in general cases.

before:

```js Document { pageContent: 'Structured Denoising Diffusion Models in Discrete\n' + 'State-Spaces\n' + 'Jacob Austin\n' + '∗\n' + ', Daniel D. Johnson\n' + '∗\n' + ', Jonathan Ho, Daniel Tarlow & Rianne van den Berg\n' + '†\n' + 'Google Research, Brain Team\n' + '{jaaustin,ddjohnson,jonathanho,dtarlow,riannevdberg}@google.com\n' + 'Abstract\n' + 'Denoising diffusion probabilistic models (DDPMs) [\n' + '19\n' + '] have shown impressive\n' + 'results on image and waveform generation in continuous state spaces. Here, we\n' + 'introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-\n' + 'like generative models for discrete data that generalize the multinomial diffusion\n' + 'model of Hoogeboom et al.\n' + '[20]\n' + ', by going beyond corruption processes with uni-\n' + 'form transition probabilities. This includes corruption with transition matrices that\n' + 'mimic Gaussian kernels in continuous space, matrices based on nearest neighbors\n' + 'in embedding space, and matrices that introduce absorbing states. The third al-\n' + 'lows us to draw a connection between diffusion models and autoregressive and\n' + 'mask-based generative models. We show that the choice of transition matrix is an\n' + 'important design decision that leads to improved results in image and text domains.\n' + 'We also introduce a new loss function that combines the variational lower bound\n' + 'with an auxiliary cross entropy loss. For text, this model class achieves strong\n' + 'results on character-level text generation while scaling to large vocabularies on\n' + 'LM1B. On the image dataset CIFAR-10, our models approach the sample quality\n' + 'and exceed the log-likelihood of the continuous-space DDPM model.\n' + '1 Introduction\n' + 'Generative modeling is a core problem in machine learning, useful both for benchmarking our ability\n' + 'to capture statistics of natural datasets and for downstream applications that require generating\n' + 'high-dimensional data like images, text, and speech waveforms. There has been a great deal of\n' + 'progress with the development of methods like GANs [\n' + '15\n' + ',\n' + '4\n' + '], VAEs [\n' + '25\n' + ',\n' + '35\n' + '], large autoregressive\n' + 'neural network models [\n' + '51\n' + ',\n' + '50\n' + ',\n' + '52\n' + '], normalizing flows [\n' + '34\n' + ',\n' + '12\n' + ',\n' + '24\n' + ',\n' + '32\n' + '], and others, each with their\n' + 'own tradeoffs in terms of sample quality, sampling speed, log-likelihoods, and training stability.\n' + 'Recently, diffusion models [\n' + '43\n' + '] have emerged as a compelling alternative for image [\n' + '19\n' + ',\n' + '46\n' + '] and au-\n' + 'dio [\n' + '7\n' + ',\n' + '26\n' + '] generation, achieving comparable sample quality to GANs and log-likelihoods comparable\n' + 'to autoregressive models with fewer inference steps. A diffusion model is a parameterized Markov\n' + 'chain trained to reverse a predefined forward process, which is a stochastic process constructed to\n' + 'gradually corrupt training data into pure noise. Diffusion models are trained using a stable objective\n' + 'closely related to both maximum likelihood and score matching [\n' + '21\n' + ',\n' + '53\n' + '], and they admit faster\n' + 'sampling than autoregressive models by using parallel iterative refinement [30, 45, 47, 44].\n' + 'Although diffusion models have been proposed in both discrete and continuous state spaces [\n' + '43\n' + '],\n' + 'most recent work has focused on Gaussian diffusion processes that operate in continuous state spaces\n' + '(e.g. for real-valued image and waveform data). Diffusion models with discrete state spaces have\n' + 'been explored for text and image segmentation domains [\n' + '20\n' + '], but they have not yet been demonstrated\n' + 'as a competitive model class for large scale text or image generation.\n' + '35th Conference on Neural Information Processing Systems (NeurIPS 2021).\n' + '∗\n' + 'Equal contributions\n' + '†\n' + 'Now at Microsoft Research\n' + 'arXiv:2107.03006v3 [cs.LG] 22 Feb 2023', metadata: { source: 'blob', blobType: 'application/pdf', pdf: { version: '1.10.100', info: [Object], metadata: null, totalPages: 33 }, loc: { pageNumber: 1 } } } ```

after:

```js Document { pageContent: 'Structured Denoising Diffusion Models in Discrete\n' + 'State-Spaces\n' + 'Jacob Austin\n' + '∗\n' + ', Daniel D. Johnson\n' + '∗\n' + ', Jonathan Ho, Daniel Tarlow & Rianne van den Berg\n' + '†\n' + 'Google Research, Brain Team\n' + '{jaaustin,ddjohnson,jonathanho,dtarlow,riannevdberg}@google.com\n' + 'Abstract\n' + 'Denoising diffusion probabilistic models (DDPMs) [19] have shown impressive\n' + 'results on image and waveform generation in continuous state spaces. Here, we\n' + 'introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-\n' + 'like generative models for discrete data that generalize the multinomial diffusion\n' + 'model of Hoogeboom et al. [20] , by going beyond corruption processes with uni-\n' + 'form transition probabilities. This includes corruption with transition matrices that\n' + 'mimic Gaussian kernels in continuous space, matrices based on nearest neighbors\n' + 'in embedding space, and matrices that introduce absorbing states. The third al-\n' + 'lows us to draw a connection between diffusion models and autoregressive and\n' + 'mask-based generative models. We show that the choice of transition matrix is an\n' + 'important design decision that leads to improved results in image and text domains.\n' + 'We also introduce a new loss function that combines the variational lower bound\n' + 'with an auxiliary cross entropy loss. For text, this model class achieves strong\n' + 'results on character-level text generation while scaling to large vocabularies on\n' + 'LM1B. On the image dataset CIFAR-10, our models approach the sample quality\n' + 'and exceed the log-likelihood of the continuous-space DDPM model.\n' + '1 Introduction\n' + 'Generative modeling is a core problem in machine learning, useful both for benchmarking our ability\n' + 'to capture statistics of natural datasets and for downstream applications that require generating\n' + 'high-dimensional data like images, text, and speech waveforms. There has been a great deal of\n' + 'progress with the development of methods like GANs [15 , 4], VAEs [25, 35 ], large autoregressive\n' + 'neural network models [51 , 50, 52 ], normalizing flows [ 34 , 12, 24 , 32 ], and others, each with their\n' + 'own tradeoffs in terms of sample quality, sampling speed, log-likelihoods, and training stability.\n' + 'Recently, diffusion models [ 43] have emerged as a compelling alternative for image [ 19, 46 ] and au-\n' + 'dio [7, 26] generation, achieving comparable sample quality to GANs and log-likelihoods comparable\n' + 'to autoregressive models with fewer inference steps. A diffusion model is a parameterized Markov\n' + 'chain trained to reverse a predefined forward process, which is a stochastic process constructed to\n' + 'gradually corrupt training data into pure noise. Diffusion models are trained using a stable objective\n' + 'closely related to both maximum likelihood and score matching [ 21, 53 ], and they admit faster\n' + 'sampling than autoregressive models by using parallel iterative refinement [30, 45, 47, 44].\n' + 'Although diffusion models have been proposed in both discrete and continuous state spaces [43 ],\n' + 'most recent work has focused on Gaussian diffusion processes that operate in continuous state spaces\n' + '(e.g. for real-valued image and waveform data). Diffusion models with discrete state spaces have\n' + 'been explored for text and image segmentation domains [20], but they have not yet been demonstrated\n' + 'as a competitive model class for large scale text or image generation.\n' + '35th Conference on Neural Information Processing Systems (NeurIPS 2021).\n' + '∗\n' + 'Equal contributions\n' + '†\n' + 'Now at Microsoft Research\n' + 'arXiv:2107.03006v3 [cs.LG] 22 Feb 2023', metadata: { source: 'blob', blobType: 'application/pdf', pdf: { version: '3.9.179', info: [Object], metadata: null, totalPages: 33 }, loc: { pageNumber: 1 } } } ```
jacoblee93 commented 11 months ago

This is awesome - .join(' ') seems better for a PDF I tried and an extra space is probably better than no extra space. Sorry for losing track of this one.

Willianwg commented 11 months ago

This is awesome - .join(' ') seems better for a PDF I tried and an extra space is probably better than no extra space. Sorry for losing track of this one.

Sorry I don't get it. In this problem the '/n' is returning between the words. For example, the word "person" would output "per/n" + "son/n" or something like that.

jacoblee93 commented 11 months ago

I had a pdf that returned no spaces with .join("") in between - see the added test

Not sure which which is more common in the wild or if there's a way to get it working better for both