mpaepper / content-chatbot

Build a chatbot or Q&A bot of your website's content
https://www.paepper.com/blog/posts/build-q-and-a-bot-of-your-website-using-langchain/
517 stars 51 forks source link

problem with lenght #3

Closed khaoss85 closed 1 year ago

khaoss85 commented 1 year ago

Traceback (most recent call last): File "/Users/pelleri/Desktop/dahu/content-chatbot-main/create_embeddings.py", line 49, in store = FAISS.from_texts(docs, OpenAIEmbeddings(), metadatas=metadatas) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/langchain/vectorstores/faiss.py", line 250, in from_texts embeddings = embedding.embed_documents(texts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 254, in embed_documents response = embed_with_retry( ^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 53, in embed_with_retry return _completion_with_retry(kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/tenacity/init.py", line 289, in wrapped_f return self(f, *args, *kw) ^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/tenacity/init.py", line 379, in call do = self.iter(retry_state=retry_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/tenacity/init.py", line 314, in iter return fut.result() ^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.2_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 449, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/python@3.11/3.11.2_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 401, in get_result raise self._exception File "/opt/homebrew/lib/python3.11/site-packages/tenacity/init.py", line 382, in call__ result = fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 51, in _completion_with_retry return embeddings.client.create(*kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create response = super().create(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/openai/api_resources/abstract/engine_apiresource.py", line 153, in create response, , api_key = requestor.request( ^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/openai/api_requestor.py", line 226, in request resp, got_stream = self._interpret_response(result, stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/lib/python3.11/site-packages/openai/api_requestor.py", line 619, in _interpret_response self._interpret_response_line( File "/opt/homebrew/lib/python3.11/site-packages/openai/api_requestor.py", line 679, in _interpret_response_line raise self.handle_error_response( openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 8951 tokens (8951 in your prompt; 0 for the completion). Please reduce your prompt; or completion length. pelleri-macpro14:content-chatbot-main pelleri$

mpaepper commented 1 year ago

Hi @khaoss85 !

Did you change anything in the create_embeddings.py?

It's normally set to chunk the content into pieces of 1500 tokens, so this should not happen.

It looks like you might not do the splitting correctly?

zodvik commented 1 year ago

@mpaepper I get the same error, without modifying anything in the create_embeddings.py

python3 create_embeddings.py -s https://www.phonepe.com/sitemap.xml -f https://www.phonepe.com/
mpaepper commented 1 year ago

The culprit is your page https://www.phonepe.com/press which gets parsed to a long text without the possibility to chunk it up (using linebreak "\n" for that in the code).

So either you need to look into how to handle that particular page and why it fails or exclude it in the create_embeddings.py, for example:

    for info in raw['urlset']['url']:
        url = info['loc']
        if args.filter in url and 'https://www.phonepe.com/press' not in url:
            pages.append({'text': extract_text_from(url), 'source': url})
zodvik commented 1 year ago

whoa, thanks for the lightning reply. that makes sense.