langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
12.43k stars 2.1k forks source link

pdf parser not working compared to python version #4014

Closed zitongzhang098 closed 5 months ago

zitongzhang098 commented 8 months ago

Hi community: I tried to parse a Chinese PDF by using PDFLoader but it returns a lot of random characters. I then tried pyparser in python and it worked as expected. Not sure what's the problem. I attached the pdf file and hope someone can help.

Langchain version

"langchain": "0.1.2"

Code in typescript

import { PDFLoader } from "langchain/document_loaders/fs/pdf";
import path from 'path';

const pdfParser = new PDFLoader(path.resolve(__dirname, './why.pdf'));
const docs = await pdfParser.load()
docs.map((doc) => {
    console.log(doc.pageContent)
}) 

Result

a lot of whitespace and !"# !!!!!!!!!!!!"

also a lot of warnings

Code in python

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("why.pdf")
pages = loader.load_and_split()

Result Preview

[Document(page_content='新编\n十万个\n为什么\n第一册\n齐豫生徐茂魁主编\n台海出版社', metadata={'source': 'why.pdf', 'page': 1}), Document(page_content='编辑委员会\n主编:\n齐豫生徐茂魁\n副主编:\n丁华民王茁芝夏于全\n编委:\n殷奎光杨泰峰周兴华\n常桦潘国平田爽\n尹建军周广宇文玉俊\n訾慧敏张桦蒲昭虹\n雷玉东郭强张秀英\n章函谷许之山王咏竹\n郭鹏飞龙维智林家鹤\n蔡磊王冰怿冯文轩\n肇英王连军邹冬红\n印政廖海虹丁荣英\n郭晓溧白德力格尔玛', metadata={'source': 'why.pdf', 'page': 2}), Document(page_content='前言\n时间作为人类时空的第四维,它时钟的标尺已经指向了二\n!!!\n这个单纯然而振奋人心的数字。人们回头看时,人类的历\n史宛如浩瀚的烟波,里面的每一个水滴都饱含着人类双手操作与\n头脑思考而凝聚的精华。\n然而我们不需望洋兴叹,一部百科全书式的著作———《新编\n十万个为什么》在世纪钟声敲响之时,欣然付梓了。它尽其所能\n地对现有的自然与人文科学做了较全面的总结。可以说,这是人\n类献给自己的一份厚重的礼物。\n全书分为\n!"\n册,共近\n#$$\n万字,囊括了人类曾经面对和正\n在面对的各种问题。藉此,崇尚科学的人们,特别是求知若渴的\n青少年,可以———\n一、认识自己。\n在古希腊,就有人提出“认识你自己”的口号。人是万物的\n灵长,人类在具有反思能力之初,就确立了其智者的地位。人自\n身是个小宇宙,在这个宇宙中,作为物质基础的生理机制和作为\n上层建筑的灵魂世界融合为一。人体·保健卷中,你将认识到作\n为生命一员的人体的奥秘;而艺术卷则向你揭示了人的精神世\n界;教育·生活卷介绍了关于成长的烦恼,将一一为您解答从幼\n儿到长大成人过程中的一系列问题。\n二、认识他人。\n你中有我,我中有你,他人是自我的延伸。人群(人类社会\n及其历史)是自我的一面面明镜,照荣辱、映兴衰。文史卷和旅\n游·民俗卷,讲述了人类的精神和行动的轨迹。\n·\n!\n·\n"\n##########\n前言', metadata={'source': 'why.pdf', 'page': 3}),

pdf attached: why.pdf

dosubot[bot] commented 8 months ago

🤖

Hello @zitongzhang098,

Welcome to the LangChain community! I'm Dosu, a bot here to assist you with bugs, answer your questions, and help you become a contributor while we await the human maintainers. Let's solve this issue together!

The issue you're experiencing with the PDFLoader in LangChainJS returning random characters and warnings when parsing a Chinese PDF is likely due to the fact that the PDF.js library used in the PDFLoader class does not handle non-Latin characters well.

In the parse method of the PDFLoader class, the PDF.js library is used to load the PDF from the buffer and retrieve the text content of each page. However, PDF.js may not correctly parse non-Latin characters, such as Chinese characters, resulting in random characters and warnings.

const pdf = await getDocument({
  data: new Uint8Array(raw.buffer),
  useWorkerFetch: false,
  isEvalSupported: false,
  useSystemFonts: true,
}).promise;

In contrast, the PyPDFLoader in Python might be using a different library or method to parse the PDF that handles non-Latin characters correctly, which is why it works as expected.

To resolve this issue, you might need to use a different library or method to parse the PDF in the PDFLoader class that can handle non-Latin characters correctly. Alternatively, you could try to configure PDF.js to correctly handle non-Latin characters, if possible.

You can find more details about the PDFLoader class and its parse method in the langchain/src/document_loaders/fs/pdf.ts file in the LangChainJS repository.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

-Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

zitongzhang098 commented 8 months ago

after trying above code, it is still the same behavior. By the way the pdf format version = 1.3, not sure if it matters

dosubot[bot] commented 5 months ago

Hi, @zitongzhang098,

I'm helping the langchainjs team manage their backlog and am marking this issue as stale. From what I understand, you reported an issue with the PDFLoader in Langchain returning random characters when parsing a Chinese PDF, despite the pyparser in Python successfully extracting the expected content. Dosu, a bot, suggested that the issue might be related to the PDF.js library used in the PDFLoader class not handling non-Latin characters well. You confirmed that the suggested code did not resolve the issue and mentioned that the PDF format version is 1.3, questioning if it matters.

Could you please confirm if this issue is still relevant to the latest version of the langchainjs repository? If it is, please let the langchainjs team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

Macbook-Specter commented 5 months ago

@zitongzhang098 Hello, I have also encountered the same problem at present, so I would like to ask, have you solved it?