n4ze3m / dialoqbase

Create chatbots with ease
https://dialoqbase.n4ze3m.com/
MIT License
1.54k stars 253 forks source link

Data Hallucination Issue with Crawler on JavaScript and CSS-heavy Websites #116

Closed J3E1 closed 7 months ago

J3E1 commented 8 months ago

Issue Description:

I've encountered an issue with data hallucination when using a web crawler on websites with a substantial amount of JavaScript and CSS. The crawler, specifically the CheerioWebBaseLoader, seems to be producing inaccurate or incomplete data due to the dynamic nature of these websites.

Steps to Reproduce:

  1. Select Crawler as Data Source.
  2. Choose a website with a large amount of JavaScript and CSS, for example, (https://www.ipangram.com/).
  3. Let the bot cook and ask any questions regarding to the website.

Expected Behavior: The llm should answer expected answers related to website.

Actual Behavior: The llm appears to be producing inaccurate answers.

Environment: Operating System: Windows 10 Embeddings: Huggingface ChatModel: Fireworks (llama-v2-7b-chat)

Screenshots: This is the chunks generated by Document Loader

image

n4ze3m commented 8 months ago

I will look into it! Thanks for reporting it

J3E1 commented 8 months ago

Is there any update @n4ze3m ?

n4ze3m commented 8 months ago

fix will be released in this week's update

J3E1 commented 8 months ago

Is this issue resolved @n4ze3m ?? 👀

n4ze3m commented 8 months ago

Hey, I tried using html-to-text, but it didn't work as expected. I'm now trying to find a better solution

n4ze3m commented 7 months ago

Fix has been released; please feel free to reopen this issue if the problem persists