Tiered Storage for Indexed Data

AravindhStanley commented 1 year ago

What: Tiered Storage for efficient live debugging and also to retain it long term

How: Store Index data to local disk or any block storage. Configure Snapshot to Cloud Storage (S3) periodically and allow expiration for both the indexed data.

Not aware of such features right now in Quickwit. But Elasticsearch does it very elegantly.

guilload commented 1 year ago

@AravindhStanley,

For how long would you want your hot data to remain on SSD or EBS? A few hours? A few days? For how long do you typically keep your data on S3? Do you use different storage classes?

fmassot commented 1 year ago

@AravindhStanley additional question: why do you want to store the data on local disk/block storage? Do you have many requests per second?

AravindhStanley commented 1 year ago

@guilload I would need to maintain 10-15 days of data on a hot storage and on a self hosted S3 compatible storage on a standard tier - 180 days.

@fmassot I have about a 1500 - 3000 servers sending data at any point in time. Most of these are OS. While the request per seconds is indeed high, its mostly due to internal policy reasons and complaince as well.

fmassot commented 1 year ago

@AravindhStanley ok. So you are sending data on the ingest endpoint right?

On the indexing (ingest endpoint) side, Quickwit keeps the incoming data on a local disk (can be EBS), and every commit_timeout_sec (default to 1 min), produces a "split" file that is then uploaded to the object storage. After the upload, Quickwit "publishes" the split to the metastore and it is available for search.

Generally, our users look for keeping data on a local disk if they have many search queries per second (hundreds of queries per second for example).

I'm still not sure why you need to keep data locally. Is it for those numerous search requests?

AravindhStanley commented 1 year ago

@fmassot We are elastic shop as of today. But I'm exploring quickwit as a potential alternative.

And thank you for explaining about the commit time. And yes, I have about a 100 users that might use live logs to debug systems and our s3 storage is designed for archival, so its not very perfomant. The need to store in disk is to be complaint with contactual agreements (primarily)

fmassot commented 1 year ago

@AravindhStanley I see.

There are 2 different subjects:

Search latency: Quickwit is typically sub-second even if all the data is on the object storage. You can have billions of logs and still be sub-second. It is generally sufficient for logs. Is it sufficient for your use case?
S3 costs: if you have hundreds of queries per second, you will make a high number of GET requests on S3. The costs will add up quickly. That being said, I'm not expecting 100 users to fire 100 queries per second. Do you have an estimation of queries per second sent in one hour for example? Are you worried about this issue in particular?

fulmicoton commented 1 year ago

@AravindhStanley Jumping in the conversation, I hope I am not adding too much noise here.

Not aware of such features right now in Quickwit. But Elasticsearch does it very elegantly.

Quickwit is designed in such a way that the "cold storage" performance is considerably higher than what you would expect from elasticsearch. Especially on a self-hosted object storage like MinIO, your expectations of the difference between hot/cold storage may not apply as strongly to Quickwit.

On MinIO, for queries >100ms, I would typically expect Quickwit to outperform Elastic on a hot storage, and be slower on small queries.

On Amazon S3, the latency is too high, and we only outperform ES on extremely large queries (roughly speaking >2s).

That being said, if you really need to have a hot tier, we are currently working on adding a cache capability to search nodes. These search nodes will store a subset of the split locally. Would you be open to discuss your use case? A better understanding of your problem could help us design the feature correctly.

AravindhStanley commented 1 year ago

@fmassot S3 costs are not a concern, since its self hosted.

@fmassot and @fulmicoton Thank you so much for taking the time to explain about the performance concern. I will definely setup and do a proper POC and its encouraging to know its works faster than I would expect in cold storage. From the techincal point of view, this seems brilliant.

I would absolutely be open to discuss my use case.

fmassot commented 1 year ago

Great @AravindhStanley! Don't hesitate to jump in our discord server to discuss your use case.

quickwit-oss / quickwit

Tiered Storage for Indexed Data #3302