tensorwerk / hangar-py

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.
Apache License 2.0
204 stars 29 forks source link

[FEATURE REQUEST] Multi-tensor samples from complex data sources #79

Open elistevens opened 5 years ago

elistevens commented 5 years ago

Is your feature request related to a problem? Please describe. I want to know how to accomplish the following. It doesn't need to require zero effort on the user's part, but there needs to be a clear best-hangar-practices path to a workable setup.

Take a source data format that is complex (e.g. DICOM, JPEG) and infeasible to reconstitute bit-exact from the tensor+metadata form. Each instance of the raw data produces a sample that consists of 2 (or more) tensors; an image tensor and a 1D tensor that encodes things like lat/long or age/sex/etc. (to be concatenated with the output of the convolutional layers prior to the fully connected layers). To be clear, this is intended to be an illustrative example, not a concrete use case.

Per my reading of the docs, right now these two tensors wouldn't qualify as being in the same hangar dataset (it's not clear if that's problematic or not).

Let's express the above conversion as:

f_v1(raw) -> (t1, t2)

Users will need to:

Describe the solution you'd like I think that changing the definition of a sample to be a tuple of binary blobs plus a tuple of tensors plus metadata would work, but I haven't considered the potential impacts from that kind of change. Seems potentially large.

Describe alternatives you've considered Another option would be to have separate datasets for t1 and t2 and combine them manually, plus manage the binary blobs separately. That seems like a lot of infra work, and might be at risk of having drift between the samples themselves, and with the blobs.

Additional context I suspect that I want/expect Hangar to solve a larger slice of the problem than it's intended to, but it's not clear at first glance what the intended approach would be for more complicated setups like the above.

lantiga commented 5 years ago

Hey @elistevens, thanks for your patience, we should have gotten to this earlier.

So, that’s right, a Dataset in Hangar maps to a single schema (a single collection of tensors). A Repo, on the other hand, can have more than one Datasets, so, for instance, to store data where two different pieces of info sit alongside each other you’d have a Repo with two Datasets. The samples in the dataset would have the same sample name.

Thus, right now what you are looking for regarding 1 sample -> 2+ tensors is possible, it’s probably a matter of:

a) terminology: a Dataset and a Sample in Hangar pertain to a single piece of what you’d call Dataset and Sample - food for thought as a possible source of confusion moving forward; b) coordination: removing one observation means removing two tensors with the same name from the two respective datasets.

In the future we might introduce integrity constraints as an option, but it’s probably not very high priority at the moment (compared to other aspects, being a young project).

As for blobs and metadata: we have sample-level metadata already, which is properly serialized and versioned together with the tensor. Depending on the user’s needs, metadata could contain an URI of the original blob, for instance. Our current design is to avoid storing the original blobs in Hangar, although you could always store them in a Dataset as 1D tensors with a byte dtype, so to speak (a bit of a hack, but not too that bad after all).

We are designing a way to go to and from the original file format (or other formats that make sense for the data) through an extensible plugin mechanism. The import step would store the necessary extra info in the metadata, the export step would look for that info in the metadata. The same plugin mechanism can also be leveraged for visualizing changes during diffs and conflict resolution, which we find quite exciting at the moment.

I hope I’ve addressed your concerns at least partially. I’d like to hear your thoughts on the design side of things.

elistevens commented 5 years ago

Perhaps confusion would be reduced if the Dataset/Sample classes were renamed to Tensorset/Tensor?

That would free up Sample to be "one or more Tensors" and Dataset to be "a collection of Samples."

I'm imagining a PyTorch Dataset subclass like:

class HangarTorchDataset(torch.utils.data.Dataset):
    def __init__(self, tensorset_list):
        self.tensorset_list = tensorset_list
        self.key_list = list(tensorset_list[0].keys())

    def __len__(self):
       return len(self.key_list)

    def __getitem__(self, ndx):
        key = self.key_list[ndx]
        return (ts[key] for ts in self.tensorset_list)

It's going to be confusing if all of those items in the tensorset_list are instances of something called a Dataset.

lantiga commented 5 years ago

Hey @elistevens, I for one like the name change quite a lot. It gives us the ability to free up the concepts for a later higher-level implementation of proper Datasets and Samples. We really need to avoid any possible confusion around the mental model, and this is a good step in that direction.

I'll let @rlizzo and @hhsecond chime in too, but this is a +1 for me.

rlizzo commented 5 years ago

Hey Eli, sorry for the delayed response here (it's been rather hectic lately, but i'm back in the groove now 😄). Addressing some of the points (in no particular order)

Nested Collections of Related Data Schemas

In some of the very first hangar prototype implementations, we had a much deeper hierarchy of containers for data which essentially would have grouped what we now think of as a "Dataset" into collections of related pieces. Many "ArraySets", as they were called at the time, each of independent schema, would constitute a single Dataset in a checkout (with arbitrarily many ArraySets or Datasets in a checkout). In addition, Metadata could be added to any ArraySet / Dataset, or be stored in its own single level key/value space as desired.

I believe this would be an architecture which would satisfy your initial problem statement?

We decided to kill this type of hardcoded architecture and flatten the namespace for a few reasons:

1) Naming and "Relationships" are difficult to keep simple, concise, and understandable (especially while trying to convey specific meanings to words like "dataset", "set", & "collection" which are fairly ambiguous in general speech).

2) Hard-coding hierarchy into the core of Hangar would be a rabbit hole with no end in sight. If we gave an inch here with a single container level, it's easy to imagine how people may want more and more levels in the future. We don't want to re-implement OS file system directory/folder heirarchies in the core of this system; hierarchies seem like an anti-pattern for keeping Hangar simple.

3) The intention is to design Hangar to be as adaptable as possible. A core tenant of the Hangar philosophy is that we believe you, the user, knows your specific problem domain and data representation best. In this early state, we only want to provide the basic building blocks to store, retrieve, merge, diff, share, and time-travel across that data, regardless of what it looks like or what it is used for. By keeping a single level hierarchy, we let you define the conventions appropriate to your specific problem:

@elistevens, does this make sense in the context of your specific problem? You're not too far off from something we probably should solve in the future, but hopefully the rationale for current behavior and workarounds are clear? At the end of the day it comes down to time allocation, if this is a real priority or show-stopper for you, let's talk about how we might be able to put a solution together.

Naming Conventions

With the above said, I'm on the fence (but leaning towards approval) for changing the name Dataset to something like Tensorset. My concern is that the term is rather abstract, and could be confusing for new Hangar users.

The term "Dataset" already has an abstract meaning in common vernacular. While we have to specify what a Hangar Dataset is (a group of samples with compatible schemas), we don't have to make a newcomer realize that Dataset holds some Data. This may be a nit-picky point, but I'd like to make sure it is very clear that a Tensorset is HIGHLY related to the term Dataset (and may interchangeable depending on the use case). What do you think?

elistevens commented 5 years ago

@rlizzo to be clear, I don't need Hangar to be an end-to-end, out of the box solution. I just want to make sure there's a clear, documented "this is the right way to solve this class of problems with Hangar" path forward.

Having multiple Tensorsets with the same set of keys and using that to build training samples by pulling the same key from each Tensorset (plus metadata as needed) should work fine. Same for having URIs point back to external blob stores.

I'd want to be able to commit/branch/etc. on multiple Tensorsets at the same time, and be able to refer to a single hash to say "this model was trained with data commit abc123." Having to specify a commit per "column" of my data would be annoying.

I agree that introducing a new name "Tensorset" isn't great, but I think that introducing a new concept but giving it a name that's familiar but the underlying concept is subtly different is worse. Those subtle differences are going to trip people up a lot, especially when your thing is more limited than the general concept.

That said, I'm not super-sold on renaming what you currently label "Sample" to "Tensor" because really what you have is a tensor (in the TF/PyTorch sense) plus the metadata. The addition of the metadata does change it a bit. I still think that Sample is the wrong name, since samples can have multiple tensors, plus metadata. Whatever the metadata+tensor combo gets called should inform what Tensorset gets named too.

"Tensor" is probably fine, since you can say "Hangar Tensors have richer metadata than Tensors in PyTorch or Tensorflow, but the represent the same fundamental concept." Since your Tensors here are more capable than the baseline idea, it's not as big a problem. Still, worth pondering, IMO.

rlizzo commented 5 years ago

On the ability to perform multi dataset operations

Ok, i misunderstood the original question a bit! Sorry! What you're actually wanting to do is actually already available via the checkout.datasets.add function. However, it currently works only for non-named dataset samples; allowing named samples as well isn't a problem (I'll make an issue to remove this restriction now). From the corresponding docstring:

    def add(self, mapping: dict) -> str:
        '''Add related samples to un-named datasets with the same generated key.

        If you have multiple datasets in a checkout whose samples are related to
        eachother in some manner, there are two ways of associating samples
        together:

        1) using named datasets and setting each tensor in each dataset to the
           same sample "name" using un-named datasets.
        2) using this "add" method. which accepts a dictionary of "dataset
           names" as keys, and "tensors" (ie. individual samples) as values.

        When method (2) - this method - is used, the internally generated sample
        ids will be set to the same value for the samples in each dataset. That
        way a user can iterate over the dataset key's in one sample, and use
        those same keys to get the other releated tensor samples in another
        dataset.

        Parameters
        ----------
        mapping: dict
            Dict mapping (any number of) dataset names to tensor data (samples)
            which to add. The datasets must exist, and must be set to accept
            samples which are not named by the user

        Returns
        -------
        str
            generated id (key) which each sample is stored under in their
            corresponding dataset. This is the same for all samples specified in
            the input dictionary.

        Raises
        ------
        KeyError
            If no dataset with the given name exists in the checkout

It'll be fairly straightforward to add an analagous datasets.get() function to handle retrieving multiple samples from individual datasets in one single operation.

Also I haven't documented this functionality in any of the tutorials or docs outside of the generated API reference. I'll try to add some notes on this over the next few days!

On the naming discussion

I agree that introducing a new name "Tensorset" isn't great, but I think that introducing a new concept but giving it a name that's familiar but the underlying concept is subtly different is worse. Those subtle differences are going to trip people up a lot, especially when your thing is more limited than the general concept.

Totally agreed on this point. I'll try to come up with an appropriate name here (any suggestions @lantiga, @hhsecond?).

That said, I'm not super-sold on renaming what you currently label "Sample" to "Tensor" because really what you have is a tensor (in the TF/PyTorch sense) plus the metadata. The addition of the metadata does change it a bit. I still think that Sample is the wrong name, since samples can have multiple tensors, plus metadata. Whatever the metadata+tensor combo gets called should inform what Tensorset gets named too.

Sorry, but would you mind clarifying what you mean by "metadata"? Aside from the containing dataset name and sample ID/name there's no other information directly attached (from the perspective of the hangar core). I may be missreading this point though, so sorry if that's the case.

elistevens commented 5 years ago

I thought that samples had a place to store things like patient age/sex/smoking history, or JPEG lat/long, etc. alongside the actual tensor. That's the metadata I mean.

Also, FYI, your terminology of "non-named dataset samples" seems confusing to me. It sounds like you're talking about datasets that auto-key Tensor-Samples added to them, vs. situations where the user provides the key. Is that right?

rlizzo commented 5 years ago

I thought that samples had a place to store things like patient age/sex/smoking history, or JPEG lat/long, etc. alongside the actual tensor. That's the metadata I mean.

So we actually removed that long ago (before the initial commit was made public) in favor of keeping everything as tensors stored in datasets. Right now that info could either be stored as related samples in another dataset (with appropriate size of the row, etc.) or in the top level (text based) metadata container.

The rationale is that the hangar core should be kept as simple as possible, having no concept of relations between datasets or samples. Higher level convenience functions (automatically dealing with aggregations, relationship mapping, etc.) can be built on top of those to eliminate boiler-plate code for the user, but only once the core is stable and there is a need for them.

Do you have a use case where storing info alongside tensors is rather important?

Also, FYI, your terminology of "non-named dataset samples" seems confusing to me. It sounds like you're talking about datasets that auto-key Tensor-Samples added to them, vs. situations where the user provides the key. Is that right?

Precisely. Should probably change it to a clearer definition/name...

lantiga commented 5 years ago

I think at this point we could give the idea of sample-level metadata a minute. It would be very convenient to keep track of provenance for a tensor (the URI of the file or resource it was generated from, for instance, or the meta data for a Dicom series). This would allow to potentially regenerate something close to the source data if needed.

We could keep such metadata as a blob in LMDB. I’m guessing the biggest hurdle would be merging. What do you think @rlizzo?

rlizzo commented 5 years ago

I'm going to take a day or two to think. I can see the benefit, but the implementation needs to be figured out. Making it work would be trivial, but making this multithread safe would be difficult (I have a Branch where Read checkouts are now completely thread and process safe, increasing throughput linearly with core count). Lmdb doesn't tend to play nice in these aspects. Ill go into more details here later.

Merging wouldn't be an issue so long as we decide what a conflict is from a conceptual standpoint.

Eli, I'm imagining this type of attached metadata would effectively act as a key/value store. Would that be sufficient? Any idea what you might want an API to be.

elistevens commented 5 years ago

I don't have any concrete use cases, nor am I certain that the "just more tensors" approach wouldn't work for anything I came up with.

I had been imagining using it to filter or shape the data in some way. Something like "cancer stage" for a binary tumor classifier that a PyTorch Dataset could use to balance the training data across stages, even if training samples of a given class are over-represented. Or being able to limit an Imagenet Tensorset to only birds+airplanes.

rlizzo commented 4 years ago

Related to #162