Open fulmicoton opened 2 years ago
This sounds interesting but I do not have any experience with HDFS or siblings. Some assumptions I would like to have challenged:
split
file systemquickwit service
process in a cluster will declare over scuttlebutt whether it is contributing to the distributed filesystem and what files are availablefast+fieldnorm+idx+pos+store+term
files can be partially accessed with something akin to get_slice
One could call this is a distributed split file system
There are many names to describe distributed storages depending on their semantics. The file system is usually the hardest, as it organize the things that is being stored into a tree, allows modification, etc.
Here we are shooting for something closer to the semantics of an object storage... The things we store are immutable, and there are real no directory structure... I say simplified because people usually associate object storage with striping, erasure coding, etc. We don't need any of that.
Indexers will not be modified, indexers will not write out to the search nodes
Indexer currently upload their result to a storage interface. This ticket is about adding a new storage interface. The storage nodes will be conveniently located on the search nodes as an optimization, but this is not a necessity.
Every quickwit service process in a cluster will declare over scuttlebutt whether it is contributing to the distributed filesystem and what files are available.
I don't think scuttlebutt is necessarily the best place to keep the list of available files. I'd rather keep this in the metastore.
The filesystem process will only serve files its node is responsible for searching, according to rendezvous hashing
No. A node should be able to serve the files it hosts to anyone via a service. The rendez-vous hashing is an optimization. We do not assume that the server processing the data is ALWAYS the server storing the data. Ideally we should write the Storage in such a way that if the data is served locally, we can bypass the RAM cache and return a slice of an mmap object directly.
Disk space quota or high watermark behavior is out of scope
Disk space quota is out of scope. I am not sure what high watermark means in this specific context.
The fast+fieldnorm+idx+pos+store+term files can be partially accessed with something akin to get_slice
I don't understand the question.
Awesome, thank you for the clarification.
Disk space quota is out of scope. I am not sure what high watermark means in this specific context.
High watermark is a common operational issue with Elasticsearch, it indicates a node is leaving it's configured "comfort zone" for storage. Elasticsearch will take steps to move data to get back below the low watermark. In Quickwit's case I imagine we would just evict files since they are in durable index storage. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#disk-based-shard-allocation
I don't understand the question.
For some reason I thought this DFS would require a new storage interface. Your explanation has clarified that. Would a process needing my-split.fast
would also want the other sibling files? Some kind of prefetch?
Let me walk through a thought exercise on how DFS would work with some pseudocode. Take this edit storage trait:
pub trait Storage: Send + Sync + 'static {
async fn get_slice(&self, path: &Path, range: Range<usize>) -> StorageResult<OwnedBytes>;
}
A user of this storage would write code something like:
use quickwit_storage::Storage;
let storage = DistributedFileSystemStorage::init();
let slice = storage.get_slice("my-index/095d28c3e4e24dc88e44ae828e5ced00.fast", (0..100)).await?;
and inside of DistributedFileSystemStorage::get_slice
it would check if the file is local, easy short-circuit. If the file is not local the storage implementation will query the metastore to find the best host with this file. We will use scuttlebutt to determine if the host is online/part of the cluster. The storage implementation will then use the grpc interface to stream the slice down locally to a tmp file, memmap it, and return it to the caller.
For some reason I thought this DFS would require a new storage interface. Your explanation has clarified that. Would a process needing my-split.fast would also want the other sibling files? Some kind of prefetch?
One split is just one bundle file, so the storage will actually never receive requests for a .fast file. The bundle file embeds several files, and a small footer with fs meta data. That level of abstraction does not care about that part. The storage layer just needs to be able to serve -given a range- partial slices of the split files. The client will be in charge of requesting the right amount of data.
and inside of DistributedFileSystemStorage::get_slice it would check if the file is local, easy short-circuit. If the file is not local the storage implementation will query the metastore to find the best host with this file. We will use scuttlebutt to determine if the host is online/part of the cluster. The storage implementation will then use the grpc interface to stream the slice down locally to a tmp file, memmap it, and return it to the caller.
Exactly.
Share my expectations for Quickwit's distributed storage.
Two layers of storage, distributed storage (SSD/EBS) and S3, allow Quickwit to distinguish hot data from cold data.
This idea comes from anna's multi-tier storage. http://www.vldb.org/pvldb/vol12/p624-wu.pdf
I use Garage for Quickwit's backend storage, it works well. Consider embedding it. https://garagehq.deuxfleurs.fr/
@yuuhhe a cool! You are actually the second person telling us they use Garage for quickwit.
Quickwit uses a
Storage
abstraction to store its data. While the entire system has been designed to work with an object storage, there are some benefit to offer a local filesystem storage: a lower latency, and it is always nice to have that kind of batteries included.Currently Quickwit offers such a file storage implementation, but this implementation does not work in a distributed environment. We would like a simple replicated file storage.
Details
Any node should be able to access a file that is on one of the other node of the cluster through gRPC.
Leaf nodes will host this data. When distributed a search request amongst the available leaf nodes, we currently use rendez-vous hashing to define some affinity between a split and a node.
We want the file placement logic to also pick the 2 replica with the highest affinity. (Let's start with a hardcoded number of replicas)
The metastore API can be extended to deal with the new metadata.
Pitfalls
Raft etc. is not required here. We assume the consistency is dealt with in the metastore API.
The solution needs however to ensure that if a put operation is not entirely successful (e.g. files could not be pushed to two replicas), we never end up with "dangling files" that never get deleted.
Out of scope