vectara / vectara-ingest

An open source framework to crawl data sources and ingest into Vectara
https://vectara.com
Apache License 2.0
147 stars 50 forks source link

Update website_crawler.py to keep clean urls #114

Closed nespera closed 2 months ago

nespera commented 2 months ago

It looks like the website crawler normalizes the URLs it gathers but then immediately throws them away. This change keeps the normalized list.