quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.74k stars 647 forks source link

About different indices and schemas #2274

Open mainrs opened 9 months ago

mainrs commented 9 months ago

My use-case is an indexing agent that indexes certain websites that a user specifies. The websites are grouped into similar technology stacks.

For example, all ***.stackoverflow.com websites are the same, all websites using Wikimedia behave the same etc.. And each of these families might have different features and properties I would like to index. Depending on the family it might be possible to extract more knowledge or more structured knowledge than from a simple website.

Other families could be audio files containing a lot of metadata I'd like to be able to query for: lyrics, year, artist. Maybe images too with their: size, timestamps, primary colors, aspect ratio etc..

This question is a follow-up of #2221. Performance wise, is it better to have one single, big index with sparse entries. Or would it be better to have a single index for each family mentioned above. And have multiple readers accessing the index files simultaneously?

It is hard to test this without building the index beforehand. But it takes a lot of time to prototype it. So I was hoping for people with more insights to give me a little bit help and advice.

I personally feel like the second approach might explode quickly if one has too many families.

PSeitz commented 8 months ago

Indexing or search performance? What type of query?

gsidhu commented 1 month ago

@mainrs did you figure out an optimal solution for your problem? I am dealing with something quite similar and would appreciate any insight you may have to offer!