mpaepper / content-chatbot

Build a chatbot or Q&A bot of your website's content
https://www.paepper.com/blog/posts/build-q-and-a-bot-of-your-website-using-langchain/
523 stars 56 forks source link

Site map problem #5

Open bibhas2 opened 1 year ago

bibhas2 commented 1 year ago

Looking at create_embeddings.py it appears that the code does not crawl what it finds in the site map. It seems to only use the URLs directly referred by the site map. If the site map has links to other site maps then the script does not work. See the example attached here. sitemap_index.xml.txt

This should be made clear in the README.

mpaepper commented 1 year ago

Hi @bibhas2

do I understand it correctly that you are referring to not having the ability to parse sitemaps which link other sitemaps in turn?

jan-koch commented 1 year ago

I'm coming across the same challenge, mostly with a sitemap generated by Yoast SEO in WordPress. That creates a "sitemap_index.xml" which then links to individual sitemaps for each post type - and those "sub" sitemaps contain the actual web pages that need to be vectorized

bibhas2 commented 1 year ago

Hi @bibhas2

do I understand it correctly that you are referring to not having the ability to parse sitemaps which link other sitemaps in turn?

Yes, you are correct. Sorry, I should be explained it better. I am modifying the bug report to add more details.

jan-koch commented 1 year ago

@mpaepper not sure how to do a PR so sharing the updated file here.

Thanks to GPT (I'm not a Pythond developer), I was able to fix the challenge of scraping Yoast SEO sitemaps. Here's the gist with the updated create_embeddings.py

https://gist.github.com/jan-koch/d563d3d5e182aaee83c0ba68f5c5520a

It works on my end but as I said, I wrote most of the updated code with ChatGPT and while I can read and somewhat understand it, I'm not sure if it has any major flaws.