It does not work. Either take it down or fix it.

vmajor commented 1 year ago

Has many errors. URL is incorrect for fetch, ingest.py gives an error even when the URL is changed to something that works and enough content is fetched. Too much effort just to try what is meant to be a demo application.

henning101 commented 1 year ago

Unfortunately I run into the same issues. The URL seems to be incorrect and FAISS throws out of range errors.

mooerccx commented 12 months ago

I encountered the same problem as you, which prevented me from loading the Langchain doc. Is the biggest significance of this project to facilitate our interaction with Langchain doc more conveniently? This overshadows this project and we hope to resolve it as soon as possible

mooerccx commented 12 months ago

Perhaps we can manually save the doc for the program to parse?

vmajor commented 12 months ago

The issue is intent. The use case of this application is to ease us into the langchain ecosystem, and it is not doing that. Second, the lack of attention and care on behalf of the team to even look at the bug reports is also making me feel that going back to writing my own solution from scratch is far simpler than learning a broken, unsupported, unmaintained framework. Perhaps harsh, but it is factual.

...and no, ingest.py will fail regardless.

urbanscribe commented 12 months ago

broken. why have it up.

joaocarlosleme commented 12 months ago

I was able to get it working after fixing the errors. Try this fork. Already made a pull request but has not been approved yet. The one issue you might find is a limit on OpenID API requests, because there is a lot of content on the site to digest.

urbanscribe commented 12 months ago

thanks @joaocarlosleme

could not get your fork to work (went further than this repo though) would be good to have it running when workin in langchain as an interactive knowledge base

this is where I ended up

python ./ingest.py /Users/alexfuchs/opt/anaconda3/envs/langchain/lib/python3.11/site-packages/langchain/document_loaders/readthedocs.py:48: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 48 of the file /Users/alexfuchs/opt/anaconda3/envs/langchain/lib/python3.11/site-packages/langchain/document_loaders/readthedocs.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

_ = BeautifulSoup( Traceback (most recent call last): File "/Users/~/github/chat-langchain/./ingest.py", line 36, in ingest_docs() File "/Users/~/github/chat-langchain/./ingest.py", line 28, in ingest_docs vectorstore = FAISS.from_documents(documents, embeddings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/~/opt/anaconda3/envs/langchain/lib/python3.11/site-packages/langchain/vectorstores/base.py", line 413, in from_documents return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)

joaocarlosleme commented 12 months ago

@urbanscribe usually a warning is just that; it should work fine with the warning. If there is an ERROR, that is where the attention should be. Did you get the content from the URL saved on a local directory 'api.python.langchain.com`? What is the given ERROR when running ingest.py?

I´m using python 3.10 and noticed you are on 3.11. Not sure if it might change something.

urbanscribe commented 11 months ago

@joaocarlosleme thanks very much for writing. I rebuilt the environment with 3.10

./ingest.sh on your repo code runs and downloads all the urls to local but I do not get a vectorstore.pkl created so it does not exist and the main script fails this way

ingest.sh ends with

--2023-07-29 13:07:12--  https://api.python.langchain.com/en/latest/_modules/index.html
Reusing existing connection to api.python.langchain.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘api.python.langchain.com/en/latest/_modules/index.html’

api.python.langchain.com/en/latest/_     [ <=>                                                                 ]  78.89K  --.-KB/s    in 0.005s

2023-07-29 13:07:13 (15.1 MB/s) - ‘api.python.langchain.com/en/latest/_modules/index.html’ saved [80781]

--2023-07-29 13:07:13--  https://api.python.langchain.com/en/latest/_modules/pydantic/config.html
Reusing existing connection to api.python.langchain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2023-07-29 13:07:13 ERROR 404: Not Found.

--2023-07-29 13:07:13--  https://api.python.langchain.com/en/latest/_modules/pydantic/env_settings.html
Reusing existing connection to api.python.langchain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2023-07-29 13:07:13 ERROR 404: Not Found.

--2023-07-29 13:07:13--  https://api.python.langchain.com/en/latest/_modules/pydantic/utils.html
Reusing existing connection to api.python.langchain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2023-07-29 13:07:14 ERROR 404: Not Found.

FINISHED --2023-07-29 13:07:14--
Total wall clock time: 5m 1s
Downloaded: 2030 files, 69M in 8.2s (8.37 MB/s)

main script fails like this

make start
uvicorn main:app --reload --port 9000
INFO:     Will watch for changes in these directories: ['/Users/alexfuchs/Developer/chat-langchain']
INFO:     Uvicorn running on http://127.0.0.1:9000 (Press CTRL+C to quit)
INFO:     Started reloader process [84795] using StatReload
INFO:     Started server process [84797]
INFO:     Waiting for application startup.
ERROR:    Traceback (most recent call last):
  File "/Users/alexfuchs/anaconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 677, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/Users/alexfuchs/anaconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 566, in __aenter__
    await self._router.startup()
  File "/Users/alexfuchs/anaconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 654, in startup
    await handler()
  File "/Users/alexfuchs/Developer/chat-langchain/main.py", line 24, in startup_event
    raise ValueError("vectorstore.pkl does not exist, please run ingest.py first")
ValueError: vectorstore.pkl does not exist, please run ingest.py first

I will put this comment on your repo also perhaps the convo is better there

joaocarlosleme commented 11 months ago

@urbanscribe the 404 error must have caused ingest.py not to run. Just run ingest.py manually and check if it creates the vectorstore.pkl before running make start.

urbanscribe commented 11 months ago

I preferred to separate out the readthedocs fetcher from the embeddings and hacked away at the embed.py

this runs for me if helpful to anyone else. thanks @joaocarlosleme


"""Load html from files, clean up, split, ingest into Weaviate."""
import pickle
import platform

from dotenv import load_dotenv
from langchain.document_loaders import ReadTheDocsLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS

load_dotenv()

import os

def load_html_docs():
    """Load documents from web pages and return them."""
    if platform.system() == "Windows":
        loader = ReadTheDocsLoader("api.python.langchain.com/en/latest/", "utf-8-sig")
        print("\nusing utf-8-sig windows")
        print(f"Current working directory: {os.getcwd()}")
    else:
        loader = ReadTheDocsLoader("api.python.langchain.com/en/latest/", "utf-8-sig")
        print("\nusing utf-8-sig")
        print(f"Current working directory: {os.getcwd()}")

    raw_documents = loader.load()
    return raw_documents

def create_vectors_and_save(raw_documents):
    print("Raw documents length:", len(raw_documents)) # Print raw_documents length

    """Create vectors from raw documents and save them to a pickle file."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )
    documents = text_splitter.split_documents(raw_documents)
    print("Documents length:", len(documents)) # Print documents length
    print("Documents split into chunks.")

    embeddings = OpenAIEmbeddings()
    print("Embeddings:", embeddings) # Print embeddings if it's feasible
    print("OpenAI Embeddings created.")

    vectorstore = FAISS.from_documents(documents, embeddings)
    print("Vectorstore created from documents.")

    # Save vectorstore
    with open("vectorstore.pkl", "wb") as f:
        pickle.dump(vectorstore, f)
    print("Vectorstore saved to pickle file.")

def load_local_html_docs():
    """Load locally saved HTML documents and return them."""
    path = "api.python.langchain.com/en/latest/" # Adjust the path as needed
    if platform.system() == "Windows":
        loader = ReadTheDocsLoader(path, "utf-8-sig")
        print("\nusing utf-8-sig windows")
    else:
        loader = ReadTheDocsLoader(path, "utf-8-sig")
        print("\nusing utf-8-sig")

    raw_documents = loader.load()
    return raw_documents

if __name__ == "__main__":
    raw_documents = load_local_html_docs()
    create_vectors_and_save(raw_documents)

# if __name__ == "__main__":
#     raw_documents = load_html_docs()
#     create_vectors_and_save(raw_documents)

langchain-ai / chat-langchain

It does not work. Either take it down or fix it. #102