langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.85k stars 15.12k forks source link

MongoDBAtlasVectorSearch the PDF exceeding 100 pages cannot be processed. How can this be resolved? #26518

Open GzRichChen opened 1 month ago

GzRichChen commented 1 month ago

Checked other resources

Example Code

loader = PyPDFLoader(file_path) data = loader.load() text_splitter = RecursiveCharacterTextSplitter( chunk_size=400, chunk_overlap=20, length_function=len, add_start_index=True) embed_model = ZhipuAIEmbeddings( model=EMBEDDING_MODEL, api_key=MODEL_API_KEY )

docs = text_splitter.split_documents(data)

    vector_store = MongoDBAtlasVectorSearch.from_documents(
        documents=docs,
        embedding=embed_model,
        collection=collection,
        index_name="vector_index"
    )

Error Message and Stack Trace (if applicable)

raise self._make_status_error(err.response) from None zhipuai.core._errors.APIRequestFailedError: Error code: 400, with error text {"error":{"code":"1210","message":"API 调用参数有误,请检查文档。"}}

Description

raise self._make_status_error(err.response) from None zhipuai.core._errors.APIRequestFailedError: Error code: 400, with error text {"error":{"code":"1210","message":"API 调用参数有误,请检查文档。"}}

System Info

aenum==3.1.15 aiofiles==24.1.0 aiohappyeyeballs==2.3.5 aiohttp==3.10.3 aiolimiter==1.1.0 aiosignal==1.3.1 annotated-types==0.7.0 anyio==4.4.0 anytree==2.12.1 appnope==0.1.4 APScheduler==3.10.4 argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 arrow==1.3.0 asgiref==3.8.1 asteval==1.0.2 asttokens==2.4.1 async-lru==2.0.4 attrs==24.2.0 autograd==1.7.0 azure-common==1.1.28 azure-core==1.30.2 azure-identity==1.17.1 azure-search-documents==11.5.1 azure-storage-blob==12.22.0 babel==2.16.0 backoff==2.2.1 bcrypt==4.2.0 beartype==0.18.5 beautifulsoup4==4.12.3 bleach==6.1.0 build==1.2.1 cachetools==5.4.0 certifi==2024.7.4 cffi==1.17.0 chardet==5.2.0 charset-normalizer==3.3.2 chroma-hnswlib==0.7.6 chromadb==0.5.5 click==8.1.7 cloudpickle==3.0.0 coloredlogs==15.0.1 comm==0.2.2 contourpy==1.2.1 cramjam==2.8.3 cryptography==43.0.0 cycler==0.12.1 dask==2024.8.1 dask-expr==1.1.11 dataclasses-json==0.6.7 datashaper==0.0.49 debugpy==1.8.5 decorator==5.1.1 defusedxml==0.7.1 Deprecated==1.2.14 deprecation==2.1.0 devtools==0.12.2 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 dnspython==2.6.1 environs==11.0.0 executing==2.0.1 faiss-cpu==1.8.0.post1 fastapi==0.112.2 fastjsonschema==2.20.0 fastparquet==2024.5.0 filelock==3.15.4 flatbuffers==24.3.25 fonttools==4.53.1 fqdn==1.5.1 frozenlist==1.4.1 fsspec==2024.6.1 funcsigs==1.0.2 future==1.0.0 gensim==4.3.3 google-auth==2.34.0 googleapis-common-protos==1.65.0 graphrag==0.3.1 graspologic==3.4.1 graspologic-native==1.2.1 grpcio==1.66.1 gs-quant==1.0.108 h11==0.14.0 hs-config==0.1.2 html5tagger==1.3.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.24.6 humanfriendly==10.0 hyppo==0.4.0 idna==3.7 importlib_metadata==8.4.0 importlib_resources==6.4.4 inflection==0.5.1 ipykernel==6.29.5 ipython==8.26.0 ipywidgets==8.1.3 isodate==0.6.1 isoduration==20.11.0 jedi==0.19.1 Jinja2==3.1.4 jiter==0.5.0 joblib==1.4.2 json5==0.9.25 json_repair==0.26.0 jsonpatch==1.33 jsonpointer==3.0.0 jsonschema==4.23.0 jsonschema-specifications==2023.12.1 jupyter==1.0.0 jupyter-console==6.6.3 jupyter-events==0.10.0 jupyter-lsp==2.2.5 jupyter_client==8.6.2 jupyter_core==5.7.2 jupyter_server==2.14.2 jupyter_server_terminals==0.5.3 jupyterlab==4.2.4 jupyterlab_pygments==0.3.0 jupyterlab_server==2.27.3 jupyterlab_widgets==3.0.11 kiwisolver==1.4.5 kubernetes==30.1.0 lancedb==0.11.0 langchain==0.2.13 langchain-community==0.2.12 langchain-core==0.2.30 langchain-mongodb==0.1.8 langchain-openai==0.1.21 langchain-text-splitters==0.2.2 langsmith==0.1.99 linkify-it-py==2.0.3 llvmlite==0.43.0 lmfit==1.3.2 locket==1.0.0 loguru==0.7.2 lxml==5.3.0 markdown-it-py==3.0.0 MarkupSafe==2.1.5 marshmallow==3.21.3 matplotlib==3.9.2 matplotlib-inline==0.1.7 mdit-py-plugins==0.4.1 mdurl==0.1.2 mistune==3.0.2 mmh3==4.1.0 monotonic==1.6 more-itertools==10.4.0 motor==3.1.2 mpmath==1.3.0 msal==1.30.0 msal-extensions==1.2.0 msgpack==1.0.8 multidict==6.0.5 mypy-extensions==1.0.0 nbclient==0.10.0 nbconvert==7.16.4 nbformat==5.10.4 nest-asyncio==1.6.0 networkx==3.3 nltk==3.9.1 notebook==7.2.1 notebook_shim==0.2.4 numba==0.60.0 numpy==1.26.4 oauthlib==3.2.2 onnxruntime==1.19.0 openai==1.40.6 opentelemetry-api==1.27.0 opentelemetry-exporter-otlp-proto-common==1.27.0 opentelemetry-exporter-otlp-proto-grpc==1.27.0 opentelemetry-instrumentation==0.48b0 opentelemetry-instrumentation-asgi==0.48b0 opentelemetry-instrumentation-fastapi==0.48b0 opentelemetry-proto==1.27.0 opentelemetry-sdk==1.27.0 opentelemetry-semantic-conventions==0.48b0 opentelemetry-util-http==0.48b0 opentracing==2.4.0 orjson==3.10.6 overrides==7.7.0 packaging==24.1 pandas==2.2.2 pandocfilters==1.5.1 parso==0.8.4 partd==1.4.2 patsy==0.5.6 pexpect==4.9.0 pillow==10.4.0 platformdirs==4.2.2 portalocker==2.10.1 posthog==3.6.0 POT==0.9.4 prometheus_client==0.20.0 prompt_toolkit==3.0.47 protobuf==4.25.4 psutil==6.0.0 ptyprocess==0.7.0 pure_eval==0.2.3 py==1.11.0 pyaml-env==1.2.1 pyarrow==15.0.2 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycparser==2.22 pydantic==2.8.2 pydantic-settings==2.3.4 pydantic_core==2.20.1 pydash==6.0.2 Pygments==2.18.0 PyJWT==2.8.0 pylance==0.15.0 pymongo==4.8.0 pynndescent==0.5.13 pyparsing==3.1.4 pypdf==4.3.1 PyPDF2==3.0.1 PyPika==0.48.9 pyproject_hooks==1.1.0 python-dateutil==2.9.0.post0 python-docx==1.1.2 python-dotenv==1.0.1 python-json-logger==2.0.7 pytz==2024.1 PyYAML==6.0.2 pyzmq==26.1.0 qtconsole==5.5.2 QtPy==2.4.1 ratelimiter==1.2.0.post0 referencing==0.35.1 regex==2024.7.24 requests==2.32.3 requests-oauthlib==2.0.0 retry==0.9.2 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rich==13.7.1 rpds-py==0.20.0 rsa==4.9 sanic==24.6.0 sanic-api==0.2.9 sanic-base-extension==0.2.0 sanic-ext==23.12.0 sanic-motor==0.7.0 sanic-routing==23.12.0 scikit-learn==1.5.1 scipy==1.12.0 seaborn==0.13.2 Send2Trash==1.8.3 shellingham==1.5.4 six==1.16.0 smart-open==7.0.4 sniffio==1.3.1 soupsieve==2.5 SQLAlchemy==2.0.32 stack-data==0.6.3 starlette==0.38.2 statsmodels==0.14.2 swifter==1.4.0 sympy==1.13.2 tenacity==8.5.0 terminado==0.18.1 textual==0.76.0 threadpoolctl==3.5.0 tiktoken==0.7.0 tinycss2==1.3.0 tokenizers==0.20.0 toolz==0.12.1 tornado==6.4.1 tqdm==4.66.5 tracerite==1.1.1 traitlets==5.14.3 twython==3.9.1 typer==0.12.5 types-python-dateutil==2.9.0.20240316 typing-inspect==0.9.0 typing_extensions==4.12.2 tzdata==2024.1 tzlocal==5.2 uc-micro-py==1.0.3 ujson==5.10.0 umap-learn==0.5.6 umongo==3.1.0 uncertainties==3.2.2 uri-template==1.3.0 urllib3==2.2.2 uvicorn==0.30.6 uvloop==0.20.0 watchfiles==0.24.0 wcwidth==0.2.13 webcolors==24.6.0 webencodings==0.5.1 websocket-client==1.8.0 websockets==12.0 widgetsnbextension==4.0.11 wrapt==1.16.0 yarl==1.9.4 zhipuai==2.1.4.20230812 zipp==3.20.0

arnavp103 commented 23 hours ago

Hello, we've noticed an issue with the way large pdfs are passed to handlers from MongoDB. My team and I are from the University of Toronto, and we are going to take a look and work on this issue, hopefully to have a working PR soon.