pinecone-io / pinecone-vercel-starter

Pinecone + Vercel AI SDK Starter
https://pinecone-vercel-example.vercel.app
418 stars 127 forks source link

Update crawler.ts: Add hostname check to keep crawler on the same domain. #15

Open dougwithseismic opened 11 months ago

dougwithseismic commented 11 months ago

Problem

The existing crawler logic adds any found URLs to the queue without checking if they belong to the same domain as the starting URL. This could result in the crawler venturing off into unrelated domains, especially social links, which may not be desirable for the scope of the crawl.

Solution

Introduced a hostname check in the addNewUrlsToQueue function. This ensures that only URLs that have the same hostname as the starting URL are added to the queue for crawling. This feature helps in restricting the crawler within the scope of the initial domain, thereby making the crawl more focused and efficient.

Type of Change

New feature (non-breaking change which adds functionality)

Test Plan

Run the crawler on a starting URL that contains both internal and external links. Verify that only URLs belonging to the same hostname as the starting URL are added to the queue. Optionally, you can print the queue or keep logs to verify that the URLs are indeed from the same hostname.

dougwithseismic commented 11 months ago

Tested and works like a charm. I'd also like to extend crawler to take an optional regex, what do you think?