Distributed Storage. - Githubissues

quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

https://quickwit.io

Other

8.11k stars 331 forks source link

Distributed Storage. #1150

Open fulmicoton opened 2 years ago

fulmicoton commented 2 years ago

Quickwit uses a Storage abstraction to store its data. While the entire system has been designed to work with an object storage, there are some benefit to offer a local filesystem storage: a lower latency, and it is always nice to have that kind of batteries included.

Currently Quickwit offers such a file storage implementation, but this implementation does not work in a distributed environment. We would like a simple replicated file storage.

Details

Any node should be able to access a file that is on one of the other node of the cluster through gRPC.

Leaf nodes will host this data. When distributed a search request amongst the available leaf nodes, we currently use rendez-vous hashing to define some affinity between a split and a node.

We want the file placement logic to also pick the 2 replica with the highest affinity. (Let's start with a hardcoded number of replicas)

The metastore API can be extended to deal with the new metadata.

Pitfalls

Raft etc. is not required here. We assume the consistency is dealt with in the metastore API.

The solution needs however to ensure that if a put operation is not entirely successful (e.g. files could not be pushed to two replicas), we never end up with "dangling files" that never get deleted.

Out of scope

This is not an object storage we are building. We do NOT want striping, or erasure coding.
We do NOT want to let user modifying files. A put request over an existing file can return an error. The storage trait should give a good idea of how restricted the API is.
In this ticket, you may discuss how the loss of replica should be handled, but do not implement this.

xrl commented 2 years ago

This sounds interesting but I do not have any experience with HDFS or siblings. Some assumptions I would like to have challenged:

One could call this is a distributed split file system
Indexers will not be modified, indexers will not write out to the search nodes
Every quickwit service process in a cluster will declare over scuttlebutt whether it is contributing to the distributed filesystem and what files are available
The filesystem process will only serve files its node is responsible for searching, according to rendezvous hashing
Disk space quota or high watermark behavior is out of scope
The fast+fieldnorm+idx+pos+store+term files can be partially accessed with something akin to get_slice

fulmicoton commented 2 years ago

One could call this is a distributed split file system

There are many names to describe distributed storages depending on their semantics. The file system is usually the hardest, as it organize the things that is being stored into a tree, allows modification, etc.

Here we are shooting for something closer to the semantics of an object storage... The things we store are immutable, and there are real no directory structure... I say simplified because people usually associate object storage with striping, erasure coding, etc. We don't need any of that.

Indexers will not be modified, indexers will not write out to the search nodes

Indexer currently upload their result to a storage interface. This ticket is about adding a new storage interface. The storage nodes will be conveniently located on the search nodes as an optimization, but this is not a necessity.

Every quickwit service process in a cluster will declare over scuttlebutt whether it is contributing to the distributed filesystem and what files are available.

I don't think scuttlebutt is necessarily the best place to keep the list of available files. I'd rather keep this in the metastore.

The filesystem process will only serve files its node is responsible for searching, according to rendezvous hashing

No. A node should be able to serve the files it hosts to anyone via a service. The rendez-vous hashing is an optimization. We do not assume that the server processing the data is ALWAYS the server storing the data. Ideally we should write the Storage in such a way that if the data is served locally, we can bypass the RAM cache and return a slice of an mmap object directly.

Disk space quota or high watermark behavior is out of scope

Disk space quota is out of scope. I am not sure what high watermark means in this specific context.

The fast+fieldnorm+idx+pos+store+term files can be partially accessed with something akin to get_slice

I don't understand the question.

xrl commented 2 years ago

Awesome, thank you for the clarification.

Disk space quota is out of scope. I am not sure what high watermark means in this specific context.

High watermark is a common operational issue with Elasticsearch, it indicates a node is leaving it's configured "comfort zone" for storage. Elasticsearch will take steps to move data to get back below the low watermark. In Quickwit's case I imagine we would just evict files since they are in durable index storage. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#disk-based-shard-allocation

I don't understand the question.

For some reason I thought this DFS would require a new storage interface. Your explanation has clarified that. Would a process needing my-split.fast would also want the other sibling files? Some kind of prefetch?

Let me walk through a thought exercise on how DFS would work with some pseudocode. Take this edit storage trait:

pub trait Storage: Send + Sync + 'static {
    async fn get_slice(&self, path: &Path, range: Range<usize>) -> StorageResult<OwnedBytes>;
}

A user of this storage would write code something like:

use quickwit_storage::Storage;
let storage = DistributedFileSystemStorage::init();
let slice = storage.get_slice("my-index/095d28c3e4e24dc88e44ae828e5ced00.fast", (0..100)).await?;

and inside of DistributedFileSystemStorage::get_slice it would check if the file is local, easy short-circuit. If the file is not local the storage implementation will query the metastore to find the best host with this file. We will use scuttlebutt to determine if the host is online/part of the cluster. The storage implementation will then use the grpc interface to stream the slice down locally to a tmp file, memmap it, and return it to the caller.

fulmicoton commented 2 years ago

For some reason I thought this DFS would require a new storage interface. Your explanation has clarified that. Would a process needing my-split.fast would also want the other sibling files? Some kind of prefetch?

One split is just one bundle file, so the storage will actually never receive requests for a .fast file. The bundle file embeds several files, and a small footer with fs meta data. That level of abstraction does not care about that part. The storage layer just needs to be able to serve -given a range- partial slices of the split files. The client will be in charge of requesting the right amount of data.

and inside of DistributedFileSystemStorage::get_slice it would check if the file is local, easy short-circuit. If the file is not local the storage implementation will query the metastore to find the best host with this file. We will use scuttlebutt to determine if the host is online/part of the cluster. The storage implementation will then use the grpc interface to stream the slice down locally to a tmp file, memmap it, and return it to the caller.

Exactly.

sunisdown commented 2 years ago

Share my expectations for Quickwit's distributed storage.

The distributed storage should serve both indexer and searcher nodes

for Indexer, can write data to the distributed storage more frequently, making the indexer closer to a realtime stream. the distributed storage can merge the split and publish the merged split to S3.
for searcher, can load the latest data from the distributed storage system and load more data from S3. This way the new data is in distributed storage and can be easily read by the searcher, while the cold data is in S3, which is slow but cost effective to read.

Two layers of storage, distributed storage (SSD/EBS) and S3, allow Quickwit to distinguish hot data from cold data.

This idea comes from anna's multi-tier storage. http://www.vldb.org/pvldb/vol12/p624-wu.pdf

yuuhhe commented 1 year ago

I use Garage for Quickwit's backend storage, it works well. Consider embedding it. https://garagehq.deuxfleurs.fr/

fulmicoton commented 1 year ago

@yuuhhe a cool! You are actually the second person telling us they use Garage for quickwit.