vectara / vectara-ingest

An open source framework to crawl data sources and ingest into Vectara
https://vectara.com
Apache License 2.0
147 stars 50 forks source link

Update crawler extraction #103

Closed ofermend closed 4 months ago

ofermend commented 4 months ago

Update to how we extract text from HTML in indexer. Use the playwright internal functionality.

ALso includes fix to notion crawler for better handling of title.