efficient hash calculation

jonburdo commented 2 years ago

Because LDB uses content-addressable storage, it needs a hash for every data object it indexes. In alpha, LDB uses md5. The simplest approach is to have ldb read each data object's contents during indexing to calculate the hash, but this is a lot of data transfer. For beta, it would be good to improve on this.

Using dvc objects and filesystem code (dvc data) would solve at least some of these challenges.

Some relevant differences LDB has from DVC:

Typical storage locations may have a higher number of files than many DVC-only projects track (easily up to millions)
File sizes of data objects in LDB are more predictable and probably less varied, since image files are a very common use-case (around 5-500kb each)
Most data objects are not needed locally right away. Those that are instantiated locally are often short-lived, since any given dataset may only be needed for a single experiment. So caching everything is not always feasible and not always useful.
LDB already deals with annotations, treating them differently from data objects. Annotation get copied into the ldb index, so we're really concerned about optimized data object hash calculations.

Things that could help:

Storing other checksums like e-tags for data objects in ldb index. If we find a file in s3 with the same e-tag as a file in our index, we use the hash associated with the file in the index.
AWS lambdas or similar
Annotations can be used for ldb-specific metadata. So we could theoretically support reading pre-calculated hashes from a certain section of annotations or a global metadata file (but how useful would this be and how reliable?)
ldb export from one instance to another

Repeated indexing

For any use-case that involves repeated indexing, LDB should be able to identify which data objects have already been indexed and avoid re-calculating the hash. LDB storage locations are assumed to be maintained as append-only storage locations (existing files may not be modified, but new files may be added), so aside from certain metadata, such as access time, nothing should change on already-indexed files. So assuming the storage location is not corrupted, this would only require ldb to check file paths against the paths it has indexed. (LDB alpha could do this, but not efficiently if the index is large.)

Possible use cases

Case 1: cloud to local LDB instance

A user indexes many images in an s3 bucket with a new LDB instance on their laptop. They have a slow internet connection. They don't plan to ever download most of the images, but want to be able to filter/query on their annotations and metadata.

This is where a lambda or just an ec2 instance would help - anything with a faster download speed, can stream data object contents and calculate the hash, then just pass the hash to the user. LDB would need to support this flow though.

Case 2: multiple LDB instances on shared storage location

Multiple teams within an organization each want to maintain their own LDB instance. Some of the storage locations they plan to index are the same.

This is where a method for calculating hashes only once for a shared storage location would help. There could be a couple options for this:

Create one large LDB instance that indexes all shared data. If LDB supports easily exporting to another instance, then the other LDB instances can read import from this shared LDB instance instead of indexing storage directly.
Add support for saving hashes into annotations or a separate metadata file in the storage location that can be read next time an ldb instance indexes.

Case 3: public dataset registry

An s3 bucket with public read-access is shared for anyone to use as an LDB storage location. Many people and indexing and instantiating from it.

In this case, it would really be good to provide the hashes via annotations or a meta file. ldb index shouldn't need to download data objects.

jonburdo commented 2 years ago

@iterative/dvc Some things I'll be trying to figure out for beta are:

What optimizations should be covered by dvc-objects and dvc-data?
What are we still missing?
Will other hashes besides md5 be usable if we are using dvc-objects and dvc-data? (It seems it would be much easier to support multiple hash algorithms in these, even if dvc itself can't) Not sure we'll use something else for beta, but moving away from md5 might be easier sooner rather than later

casperdcl commented 2 years ago

See also the totally very brief discussions at https://github.com/iterative/dvc/issues/3069 (also https://github.com/iterative/dvc/issues/1676 and https://github.com/iterative/dvc/issues/3439) :)

0x2b3bfa0 commented 2 years ago

TL;DR Take a look to BLAKE3 if the main goal is speed, or SHA3–256 if the main goal is conservative security and NIST blessing. Supporting multiple algorithms might open a can of worms. Usually, choose–your–own–adventure cryptography is not a good idea.

karajan1001 commented 2 years ago

In one of my previous companies, I used to work on a system storing millions of pictures. The picture is stored in HFDS, and metadata is stored in MongoDB. Hashcode is used as the primary key and calculated on the thousands of terminals that generated the pictures. All Other fields are mutable, what I'm doing is to write a program to parse the text from the picture and then update the metadata (The most computational intensive work OCR is also done on the terminal side).

And I agreed that in LDB, all heavy calculations should be close to the data to minimize data transfer. Maybe the same for DVC on the remote dependency case.

shcheklein commented 2 years ago

@karajan1001 this reminds be DVC. did you use hash after that in a "DVC-way" as file names on HDFS?

@jonburdo I think it's worth clarifying the workload pattern (e.g. many objects initially, small updates after vs many objects initially and many object every day).

For example, it might mean that people would be fine to run indexing no matter where and how we calculate it. E.g. they could run EC2 instance once. And if updates are small it might be more problematic to find the files to index vs actually indexing them.

This is where a lambda or just an ec2 instance would help

we can recommend folks running the instance first time if they are indexing a lot of stuff ... complexity-wise it might be comparable to lambda

karajan1001 commented 2 years ago

did you use hash after that in a "DVC-way" as file names on HDFS?

No, we use terminal MAC address + creating DateTime string instead. Most of the important Information had already been in MongoDB, the raw picture is only used for backup I think.

pared commented 2 years ago

@jonburdo and how are the files going to be stored? Is it gonna be a cache - like structure in DVC?

jonburdo commented 2 years ago

Usually, choose–your–own–adventure cryptography is not a good idea.

@0x2b3bfa0 Thanks, that's helpful input. It does seem like a configurable hash could get in the way. sha256 seems best, but I would need to investigate if speed will be a problem.

If we calculate the hash only of the file contents like dvc does instead of prepending something to the hashed content like git does (e.g. blob filesize\0) then it's trivial to find cases where our system breaks with md5. It may be unlikely for the typical user, but there's technically known data blobs that could be silently interchanged if someone decided to put them in a dataset. It might make sense to use whatever dvc switches to. It may not provide any benefit, but it's also interesting to note that git now supports sha256 I believe.

jonburdo commented 2 years ago

And I agreed that in LDB, all heavy calculations should be close to the data to minimize data transfer.

@karajan1001 Good point. This highlights another point that @volkfox and I have discussed, which is the fact that streaming data objects through a model to sort, filter, extract features without first downloading them locally may be a common use case. For example user may want a downsampled or processed image to be instantiated instead of the original, or they may want to instantiate the first 100 images that satisfy a certain condition when passed through a model.

So if we will need to support use of lambdas or intermediate servers between storage and the LDB instance for these cases, then it would be natural to use them for hash calculation.

jonburdo commented 2 years ago

@jonburdo I think it's worth clarifying the workload pattern (e.g. many objects initially, small updates after vs many objects initially and many object every day).

@shcheklein I think we'll need to support both, but should maybe prioritize the small updates pattern to start. Some people are regularly collecting additional data, but if they're doing something like online learning, they may not bring all of it into a system like LDB. @volkfox may have more to add

And if updates are small it might be more problematic to find the files to index vs actually indexing them.

Quickly filtering out the already-indexed files is definitely something we need to handle. We may be able to make use of DVC's code for this - need to look into this more

jonburdo commented 2 years ago

@jonburdo and how are the files going to be stored? Is it gonna be a cache - like structure in DVC?

@pared The goal is to make use of what we can in dvc-objects and dvc-data once those are split out. We may not always want things we index to be downloaded into the cache, and I'm not sure right now how much of the dvc cache code we'll want to use. We definitely want caching for datasets that are instantiated locally and for metadata, so we can easily check if the storage (an append-only data lake) has been corrupted and which files have not yet been indexed (newly added files)

iterative / ldb