Open converseKarl opened 5 months ago
Not sure I understand. What is the use case?
i was hoping there was a HTML text extraction method so meta can be added before passing it into the splitter service. I might have been thing this was all baked into langchain. If i'm wrong, by all means close this. The idea was to add supporting data / meta into the html doc tat's been beutifully souped ,and then pass it into be split after pre processing via one of my own cstom services.
OK i have a user case for this. Two types of website, static html, you can add web urls and loaders. All fine. Some dynamic pages have ajax or scripts that run to generate content or reguire login. The second types don't render pages imediately and quite a few websites out there have this have this (CRM backends, Salesforce, Ajax etc etc). So the pages show but there is a delay in rendering content in that page..
Point is this
Just sharing my thoughts with everyone!
In langchain, // Initialize the HTMLLoader with the HTML content const loader = new HTMLLoader(htmlContent);
// Load the document (parsing the HTML)
const document = await loader.load();
This will cover the situation of dynamic rendered pages, rather than assuming all pages are static which the webLoader does.
many thanks
not the same as add WebLoader, but LangChain has this hook to pass in a HTML page content, and some other settings. Under certain conditions this could be very userful to customize the HTML content or pre-process it before passing it to the loader