Closed aravindkarnam closed 1 month ago
@unclecode Did you get a chance to review these specs? I'm blocked on the following 👇 to begin coding the integration.
- [ ] Eventually we would want to make this integration available in
langchain-community
package. However we would need some trial and errors + testing before raising a PR directly to that repo. For the first-cut of Langchain integration, we have two options.
- Fork a new repository from langchain community repo and start implementation in the
/libs/community/langchain_community/document_loaders
. Build a couple of versions, make it available to community and finally when we are ready with integration tests, units tests and documentation with a mature first version, we can raise a PR to the main langchain repo. Langchain's community contribution guidelines suggests this approach.- Create a new module in the current crawl4ai repository itself called
Crawl4aiLoader
that extends the BaseLoader class. Developers can import this loader from crawl4ai itself, for first few versions and use it, so we can iron out implementation specifics. Then we can fork the langchain community repo when we are ready and start contributing there.
Hey @aravindkarnam,
I was going through their documentation as well, and I think Option 2 is the way to go—starting with the implementation within Crawl4AI. It gives us the freedom to experiment and refine things without getting bogged down by external guidelines.
Once we’ve got a solid, stable version, we can then think about moving to Option 1 and contributing to the LangChain community repo. This way, we can perfect the integration in our own space first before pushing it out to the broader community.
I do like the recognition Crawl4AI would get with Option 1, but since this is our first integration of this kind, I think it's safer to start with Option 2. So yeah, let’s move forward with that. Btw, I will send invitation to discord by today hopefully, there we can have very real time conversation.
@unclecode Done! You check this branch here. Here's how you can use it.
from crawl4ai.langchain import Crawl4aiLoader
crawl4aiLoader = Crawl4aiLoader(url='https://en.wikipedia.org/wiki/Cricket')
documents = crawl4aiLoader.load()
print(documents[0].page_content)
print(documents[0].metadata)
page_content
consists of markdown
when no extraction strategy is passed. When an extraction strategy is passed, the page_content
of document is the extracted_content
. The entire CrawlResult
from crawl4ai is accessbile through the metadata
field of the returned documents.
Please let me know if any changes are needed.
PS: For some reason discord kicked me out of the crawl4ai discord server automatically 😕. I've tried to join again with invite link but it's now invalid. Could you send me an invite again.
@aravindkarnam weird, let me check Discord first
@aravindkarnam check ur email Again
I close this issue, we work on PR.
Requirements
Document
( with attributespage_content: str
&metadata: dict
) object of aCrawl4aiLoader
that extends a langchain Document Loader, so that it can be fed to a chain for further processing.Crawl4aiLoader
should support the complete interface of crawl4ai'sWebCrawler
class. Similarly theload
method should support the complete interface of therun
method of crawl4ai.References
Specs
langchain-community
package. However we would need some trial and errors + testing before raising a PR directly to that repo. For the first-cut of Langchain integration, we have two options./libs/community/langchain_community/document_loaders
. Build a couple of versions, make it available to community and finally when we are ready with integration tests, units tests and documentation with a mature first version, we can raise a PR to the main langchain repo. Langchain's community contribution guidelines suggests this approach.Crawl4aiLoader
that extends the BaseLoader class. Developers can import this loader from crawl4ai itself, for first few versions and use it, so we can iron out implementation specifics. Then we can fork the langchain community repo when we are ready and start contributing there.IMO, it feels like With option 1 it will be easier to keep up with changes in the langchain updates(with sync fork option), with option2 it will be easier to keep up with changes in crawl4ai updates. But in any case option1 will be unavoidable as crawl4ai gains more adoption.
[x] In the
__init__
ofCrawl4aiLoader
aWebcrawler
must be instantiated with support for it's full interface (ie crawler_strategy: CrawlerStrategy, always_by_pass_cache: bool & verbose: bool). Also thewarmup
method must be called and it's success must be asserted.Params required for crawler.run() should be supported to be added during the instantiation itself.
[x] The
load
method ofCrawl4aiLoader
should call therun
method of the webcrawler instance withsupport to full interface (ie url: str, word_count_threshold: int, extraction_strategy: ExtractionStrategy, chunking_strategy: ChunkingStrategy, bypass_cache: bool, css_selector: str, screenshot: bool, user_agent: str, verbose: bool).load
method doesn't take any parameters, all variables required for run method should be passed by caller during the instantiation itself and stored in self.[x] The
load
method should return alist[Document]
objects which will be passed further down the chain to vectorDBs and LLMs(whichever way the user intends to use). Document has two attributespage_content: str
andmetadata: dict
.run
method returns the CrawlResult class which needs to be packaged as Document class. We can add themarkdown
aspage_content
if no extraction strategy is passed inload
method, if extraction strategy is passed the addextracted_content
aspage_content
. Rest of KVs(ie cleaned_html, media, links, screenshot,markdown,extracted_content, metadata) in CrawlResult can be added as KVs in themetadata
attribute of Document class with serialisation if needed.@unclecode Please let me know if any changes are needed to the specs and also advise which option(mentioned above) to proceed with for
Crawl4aiLoader
modulePS: I'm not paying much attention to error handling in the first cut of the integration. Hence didn't cover that in specs, we'll simply catch and throw all errors after logging to console.