llm-tools / embedJs

A NodeJS RAG framework to easily work with LLMs and embeddings
https://www.npmjs.com/package/@llm-tools/embedjs
Apache License 2.0
140 stars 21 forks source link

add HTML page Loader #54

Open converseKarl opened 4 weeks ago

converseKarl commented 4 weeks ago

not the same as add WebLoader, but LangChain has this hook to pass in a HTML page content, and some other settings. Under certain conditions this could be very userful to customize the HTML content or pre-process it before passing it to the loader

adhityan commented 4 weeks ago

Not sure I understand. What is the use case?

converseKarl commented 3 weeks ago

i was hoping there was a HTML text extraction method so meta can be added before passing it into the splitter service. I might have been thing this was all baked into langchain. If i'm wrong, by all means close this. The idea was to add supporting data / meta into the html doc tat's been beutifully souped ,and then pass it into be split after pre processing via one of my own cstom services.

converseKarl commented 2 weeks ago

OK i have a user case for this. Two types of website, static html, you can add web urls and loaders. All fine. Some dynamic pages have ajax or scripts that run to generate content or reguire login. The second types don't render pages imediately and quite a few websites out there have this have this (CRM backends, Salesforce, Ajax etc etc). So the pages show but there is a delay in rendering content in that page..

Point is this

  1. Static pages work fine using add webloader
  2. Dynamic pages will not (to my knowledge)
  3. I know I need to use a headerless crawler
  4. But need a mathod once rendered to get the page content from the headerless crawler into the rag. Textloader? HTML Loader (not web) or some other loader type? There are more loader types in langchain. Maybe get text from the headless, and pass in as a text loader is the right method?

Just sharing my thoughts with everyone!

converseKarl commented 2 weeks ago

In langchain, // Initialize the HTMLLoader with the HTML content const loader = new HTMLLoader(htmlContent);

// Load the document (parsing the HTML)
const document = await loader.load();

This will cover the situation of dynamic rendered pages, rather than assuming all pages are static which the webLoader does.

many thanks