Langchain integration for crawl4ai

aravindkarnam commented 2 months ago

Requirements

Crawl4ai's output(CrawlResult) should be encapsulated in the Document( with attributes page_content: str & metadata: dict) object of a Crawl4aiLoader that extends a langchain Document Loader, so that it can be fed to a chain for further processing.
Instantiation of the Crawl4aiLoader should support the complete interface of crawl4ai's WebCrawler class. Similarly the load method should support the complete interface of the run method of crawl4ai.

References

Specs

[x] Eventually we would want to make this integration available in langchain-community package. However we would need some trial and errors + testing before raising a PR directly to that repo. For the first-cut of Langchain integration, we have two options.
1. Fork a new repository from langchain community repo and start implementation in the /libs/community/langchain_community/document_loaders. Build a couple of versions, make it available to community and finally when we are ready with integration tests, units tests and documentation with a mature first version, we can raise a PR to the main langchain repo. Langchain's community contribution guidelines suggests this approach.
2. Create a new module in the current crawl4ai repository itself called Crawl4aiLoader that extends the BaseLoader class. Developers can import this loader from crawl4ai itself, for first few versions and use it, so we can iron out implementation specifics. Then we can fork the langchain community repo when we are ready and start contributing there.

IMO, it feels like With option 1 it will be easier to keep up with changes in the langchain updates(with sync fork option), with option2 it will be easier to keep up with changes in crawl4ai updates. But in any case option1 will be unavoidable as crawl4ai gains more adoption.

[x] In the __init__ of Crawl4aiLoader a Webcrawler must be instantiated with support for it's full interface (ie crawler_strategy: CrawlerStrategy, always_by_pass_cache: bool & verbose: bool). Also the warmup method must be called and it's success must be asserted.

Params required for crawler.run() should be supported to be added during the instantiation itself.
[x] The load method of Crawl4aiLoader should call the run method of the webcrawler instance with support to full interface (ie url: str, word_count_threshold: int, extraction_strategy: ExtractionStrategy, chunking_strategy: ChunkingStrategy, bypass_cache: bool, css_selector: str, screenshot: bool, user_agent: str, verbose: bool).

load method doesn't take any parameters, all variables required for run method should be passed by caller during the instantiation itself and stored in self.
[x] The load method should return a list[Document] objects which will be passed further down the chain to vectorDBs and LLMs(whichever way the user intends to use). Document has two attributes page_content: str and metadata: dict. run method returns the CrawlResult class which needs to be packaged as Document class. We can add the markdown as page_content if no extraction strategy is passed in load method, if extraction strategy is passed the add extracted_content as page_content. Rest of KVs(ie cleaned_html, media, links, screenshot,markdown,extracted_content, metadata) in CrawlResult can be added as KVs in the metadata attribute of Document class with serialisation if needed.

@unclecode Please let me know if any changes are needed to the specs and also advise which option(mentioned above) to proceed with for Crawl4aiLoader module

PS: I'm not paying much attention to error handling in the first cut of the integration. Hence didn't cover that in specs, we'll simply catch and throw all errors after logging to console.

aravindkarnam commented 2 months ago

@unclecode Did you get a chance to review these specs? I'm blocked on the following 👇 to begin coding the integration.

[ ] Eventually we would want to make this integration available in langchain-community package. However we would need some trial and errors + testing before raising a PR directly to that repo. For the first-cut of Langchain integration, we have two options.

Fork a new repository from langchain community repo and start implementation in the /libs/community/langchain_community/document_loaders. Build a couple of versions, make it available to community and finally when we are ready with integration tests, units tests and documentation with a mature first version, we can raise a PR to the main langchain repo. Langchain's community contribution guidelines suggests this approach.

Create a new module in the current crawl4ai repository itself called Crawl4aiLoader that extends the BaseLoader class. Developers can import this loader from crawl4ai itself, for first few versions and use it, so we can iron out implementation specifics. Then we can fork the langchain community repo when we are ready and start contributing there.

unclecode commented 2 months ago

Hey @aravindkarnam,

I was going through their documentation as well, and I think Option 2 is the way to go—starting with the implementation within Crawl4AI. It gives us the freedom to experiment and refine things without getting bogged down by external guidelines.

Once we’ve got a solid, stable version, we can then think about moving to Option 1 and contributing to the LangChain community repo. This way, we can perfect the integration in our own space first before pushing it out to the broader community.

I do like the recognition Crawl4AI would get with Option 1, but since this is our first integration of this kind, I think it's safer to start with Option 2. So yeah, let’s move forward with that. Btw, I will send invitation to discord by today hopefully, there we can have very real time conversation.

aravindkarnam commented 2 months ago

@unclecode Done! You check this branch here. Here's how you can use it.

from crawl4ai.langchain import Crawl4aiLoader

crawl4aiLoader = Crawl4aiLoader(url='https://en.wikipedia.org/wiki/Cricket')
documents = crawl4aiLoader.load()
print(documents[0].page_content)
print(documents[0].metadata)

page_content consists of markdown when no extraction strategy is passed. When an extraction strategy is passed, the page_content of document is the extracted_content. The entire CrawlResult from crawl4ai is accessbile through the metadata field of the returned documents.

Please let me know if any changes are needed.

PS: For some reason discord kicked me out of the crawl4ai discord server automatically 😕. I've tried to join again with invite link but it's now invalid. Could you send me an invite again.

unclecode commented 2 months ago

@aravindkarnam weird, let me check Discord first

unclecode commented 2 months ago

@aravindkarnam check ur email Again

unclecode commented 1 month ago

I close this issue, we work on PR.

unclecode / crawl4ai