tensorwerk / hangar-py

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.
Apache License 2.0
203 stars 29 forks source link

[FEATURE REQUEST] Model Storage #68

Open rlizzo opened 5 years ago

rlizzo commented 5 years ago

Is your feature request related to a problem? Please describe. Allow binary model storage for trained models

Describe the solution you'd like A clear and concise description of what you want to happen:

rlizzo commented 5 years ago

@lantiga let's continue our discussion here.

hhsecond commented 5 years ago

@rlizzo Why did we close this?

rlizzo commented 5 years ago

Sorry that was a mistake!

The long and short of it is that I'm leaning more towards not implementing this. I'll post more comments about my rationale later, but me and Luca talked today and came to similar conclusions.

rlizzo commented 5 years ago

Ok. so for reference, I've decided (with input from @lantiga) that we will not implement this inside of hangar. The rationale is this:

For these reasons (and maybe a few more still in my head), Hangar will not create a custom container class to store machine learning models.

That being said, it's a very valid point that people don't want to retrain machine learning models if they want to go back in time. In the future, the supporters of the Hangar project [tensor]werk inc. will release a more focused tool to handle all of this. Some preliminary designs and ideas are in the works though the timeline for an open source alpha release is not yet know.

@hhsecond, is this a valid reason in your point of view? I'm open to hearing any counter arguments to this decision.

hhsecond commented 5 years ago

@rlizzo These points make absolute sense and keeping hangar a tool for "data" (only) is probably the best way of handling it as a project/product since hangar is designed to store numerical arrays. But, In that case, just hangar is going to be useful for a niche crowd because we are essentially trying to solve a problem which most of the people in the industry is not worried about or they are too big to use hangar (it might be just me, but this is what I have seen). That being said, I could see the absence of a tool like hangar as the reason organizations haven't done what you lay down in the above thread yet.

lantiga commented 5 years ago

Hey @rlizzo and @hhsecond.

After sleeping on this, I think all we should still allow users to store binary blobs as a special case of the current functionality. Either store a blob as a dataset of one sample, where the sample is a byte array, or treat it as meta data (we may not be able to due to size considerations, not sure). So, not have explicit support for models as a concept per se, but a way to store a blob if you're so inclined. The same way git does: it's not best practice, but you can still do it if you really want it.

Models as a concept will ideally come later in a separate tool that will manage lineage across source code, data, hyperparameters, trained models, etc. It's quite possible that it will share significant portions of its source code with hangar (e.g. the Merkle DAG thing).

rlizzo commented 5 years ago

@lantiga : I suppose blob stores are a fair compromise here. If we go that route, I would keep essentially the same conventions (and probably just use most of the code) we define for metadata, namely:

The only reason this isn't totally trivial to implement is that Unlike metadata (which we define a limit for the size of values) my gut says we wouldn't store the blobs in lmdb directly as there is no concept of cross record value compression. It's not an issue for short strings, but it would be more desirable to pack and compress arbitrarily sized blobs together. We would basically be emulating what a Git pack file does. It wouldn't be hard to put together a nieve prototype, but its hard to do packs well...

Maybe that's premature though?

Of course we could provide the same partial clone semantics we are tooling for data records, but it's still very much in the works... I'm going to say for right now that I think it's needed and should be done. We just need to decide how much we care about disk space. If we don't, it's trivial, if we do, it's a project and I cannot see this getting to the top of my priority list anytime soon. (I think that even if you wanted to work on this, both if your time would be much better spent on user facing features)?

So... What are your thoughts? How important is this?

rlizzo commented 5 years ago

@hhsecond for the sake of not duplicating a bunch of documentation I've already written about the benefits of storing you data via the Hangar "philosophy", I'm going to point you to This Section Of the Docs.

Let me know if that addresses your point. It's extremely valid, and if the current explanations aren't clear or compelling enough I will make the time to improve the.

hhsecond commented 5 years ago

@rlizzo I have re-read the docs and here are my thoughts about the changes required. Let me know what do you think.

The Domain-Specific File Format Problem:

Human & Computational Cost

Others

lantiga commented 5 years ago

Reviving this thread. Especially with partial fetch, I still think we should provide the ability to store individual blobs and version them. In fact they would behave like single-tensor dsets, maybe with a dedicated data store that just writes the blob to disk as a file instead of relying on a HDF5 file. Apart from this detail, we already have everything we need, we may only think about how to make it convenient from a UI perspective.

This would solve the model store problem with very little added complexity, with the benefit of partial fetching a certain version of a model.