quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.35k stars 624 forks source link

crawler recommendation #957

Open mariusa opened 3 years ago

mariusa commented 3 years ago

Hi, the getting started with cli doc relies on already having wikipedia pages crawled in the right format. To crawl other sites, what crawler do you recommend? I've found this, but not sure how to use it: https://github.com/tantivy-search/tantivy-ccrawl

Also, when indexing a .json file (assuming data is stored in multiple json files), does does tantivy know when to

Thanks

fulmicoton commented 3 years ago

Hello! Apologies for the late answer.

Tantivy does not have any notion of primary key but you can add such a field and enforce the unicity on the application side. Concretely that means always deleting your primary key term before adding a new document. It is not cheap.

The API is called delete_term.

Common crawl is a pre-crawled dataset. If you have a business need for an index over common crawl, i'd be happy to discuss it.