langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.85k stars 15.36k forks source link

PyPDF2 used in Google Drive loader has vulnerability issues #10832

Closed joe-barhouch closed 10 months ago

joe-barhouch commented 1 year ago

System Info

The GoogleDrive loader currently uses the PyPDF2 library instead of the PyPDF. Since PyPDF2 was already merged to the original one, it shouldn't be used anymore. https://data.safetycli.com/v/59234/f17/?utm_source=pyupio&utm_medium=redirect&utm_campaign=pyup_rd&utm_id=081 This is the vulnerability issue that gets picked up, note that it is fixed in PyPDF

Also I tried to change the loader using file_loader_cls but it just doesn't work

Who can help?

No response

Information

Related Components

Reproduction

No functionality issue but https://data.safetycli.com/v/59234/f17/?utm_source=pyupio&utm_medium=redirect&utm_campaign=pyup_rd&utm_id=0817&utm_content=data showcases the bug

Expected behavior

This error should not pop up

dosubot[bot] commented 1 year ago

🤖

Thank you for bringing this to our attention. It seems that the use of PyPDF2 in the GoogleDrive loader of LangChain could indeed be a potential security risk due to the mentioned vulnerability.

As per the context provided, there was a similar issue titled Bugs in GoogleDriveLoader.file_loader_cls which was resolved by suggesting the use of the standalone component langchain-googledrive instead of the langchain implementation. It was also suggested to remove the PyPDF2 dependency as it's not needed and the library is no longer maintained (in profit of pypdf).

Regarding the version of PyPDF2 used in the GoogleDrive loader of LangChain, it is not specified in the provided context. Instead, the version of 'pypdf' is specified as "^3.4.0". Please note that 'pypdf' and 'PyPDF2' are different libraries. The version of 'PyPDF2' is still not specified.

As for the reason why PyPDF2 was chosen over PyPDF in the GoogleDrive loader, I wasn't able to find an answer within the repository. It's possible that the decision was made due to specific requirements at the time of implementation.

Given the potential security risk and the fact that PyPDF2 is no longer maintained, it would be advisable to switch to PyPDF or another actively maintained library. I will create an issue for this and we will prioritize it accordingly.

As for the issue with file_loader_cls, could you please provide more details about the error you're encountering? This will help us understand the problem better and provide a more accurate solution.

Thank you for your patience and understanding.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

joe-barhouch commented 1 year ago

As for the file_loader_cls, I wanted to pass UnstructuredPDFLoader or PdfReader but those accept file_path argument and wont' work with the GoogleDriveLoader since it requires an IO loader

ventz commented 1 year ago

@joe-barhouch As an aside - I found it obnoxious that they don't support both file_path and BytesIO.

In our case, since I had read the files (from a context which provides me the bites), I used tempfile to create a file handle and passed that. (You can use tempfile.name for the object to pass to file_path, and it works perfectly)

dosubot[bot] commented 11 months ago

Hi, @joe-barhouch,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue was raised by you regarding the use of the PyPDF2 library in the Google Drive loader, which has known vulnerability issues. Attempts to change the loader using file_loader_cls have been unsuccessful. I suggested switching to PyPDF or another actively maintained library due to the potential security risk and the fact that PyPDF2 is no longer maintained. Another user, ventz, provided a workaround using tempfile to create a file handle and pass it to the GoogleDriveLoader.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation. If you have any further questions or need assistance, feel free to reach out.