[FEATURE REQUEST] Model Storage

rlizzo commented 5 years ago

Is your feature request related to a problem? Please describe. Allow binary model storage for trained models

Describe the solution you'd like A clear and concise description of what you want to happen:

a new class in the checkout objects which allows for binary storage of model blobs

rlizzo commented 5 years ago

@lantiga let's continue our discussion here.

hhsecond commented 5 years ago

@rlizzo Why did we close this?

rlizzo commented 5 years ago

Sorry that was a mistake!

The long and short of it is that I'm leaning more towards not implementing this. I'll post more comments about my rationale later, but me and Luca talked today and came to similar conclusions.

rlizzo commented 5 years ago

Ok. so for reference, I've decided (with input from @lantiga) that we will not implement this inside of hangar. The rationale is this:

Keeping trained models alongside data in a hangar repository would be akin to adding compiled executables to commits alongside the code used to generate those executables in a Git repository.
- Hangar's primary purpose is to act as a distributed version control system for "data". While this is extremely relevant to the machine learning community, it is not a "machine learning specific" utility. If we build in specialized containers which only apply to machine learning, they have to stay forever, and need to be maintained even when the next "big thing" comes out. It's a cycle that repeats forever, and is just not good design.
- Repositories are meant to be shared, and we imagine that they will be open sourced in many cases (just like Git projects). Including binaries which are trained on a model with a specific purpose in mind is just bloat to someone who wants to fork or clone a repo and use the data contained in it in another application. Maybe the model would be useful to them, but most likely it's not. Using a project's repo as a dependency would get messy really quickly for anyone who wants to build on the project's data.
- Sharing data in the open is an idea which most people seem to agree is good. Packaging a trained model binary file which resulted from a team's model code, compute/training time, AND the data itself may prevent users who deal with proprietary applications from using Hangar at all. Enterprise users are a group which need to (at the very least) be considered in fundamental design decisions.
Even if we did add a trained model container to Hangar, there's very little chance that it will be useful in the future, as it may require a specific machine learning framework version to execute (and which may not be supported, or even available in the future).
- the only way to avoid this problem is to include the model code itself alongside the binary. At that point, it's way to complicated and specific a problem for a tool like Hangar (which just really knows about numerical bytes) to deal with itself.

For these reasons (and maybe a few more still in my head), Hangar will not create a custom container class to store machine learning models.

That being said, it's a very valid point that people don't want to retrain machine learning models if they want to go back in time. In the future, the supporters of the Hangar project [tensor]werk inc. will release a more focused tool to handle all of this. Some preliminary designs and ideas are in the works though the timeline for an open source alpha release is not yet know.

@hhsecond, is this a valid reason in your point of view? I'm open to hearing any counter arguments to this decision.

hhsecond commented 5 years ago

@rlizzo These points make absolute sense and keeping hangar a tool for "data" (only) is probably the best way of handling it as a project/product since hangar is designed to store numerical arrays. But, In that case, just hangar is going to be useful for a niche crowd because we are essentially trying to solve a problem which most of the people in the industry is not worried about or they are too big to use hangar (it might be just me, but this is what I have seen). That being said, I could see the absence of a tool like hangar as the reason organizations haven't done what you lay down in the above thread yet.

lantiga commented 5 years ago

Hey @rlizzo and @hhsecond.

After sleeping on this, I think all we should still allow users to store binary blobs as a special case of the current functionality. Either store a blob as a dataset of one sample, where the sample is a byte array, or treat it as meta data (we may not be able to due to size considerations, not sure). So, not have explicit support for models as a concept per se, but a way to store a blob if you're so inclined. The same way git does: it's not best practice, but you can still do it if you really want it.

Models as a concept will ideally come later in a separate tool that will manage lineage across source code, data, hyperparameters, trained models, etc. It's quite possible that it will share significant portions of its source code with hangar (e.g. the Merkle DAG thing).

rlizzo commented 5 years ago

@lantiga : I suppose blob stores are a fair compromise here. If we go that route, I would keep essentially the same conventions (and probably just use most of the code) we define for metadata, namely:

single level key value store where keys must be assigned a name by the user. Values must be raw bytes.
none of the intelligent diff/merge inference methods will be used like for numerical arrays. If there is a conflict in hash values, the user needs to fix it themselves.
performance should be reasonable for IO, but no guarantees that it is the most optimized key/value store out there (it's not)

The only reason this isn't totally trivial to implement is that Unlike metadata (which we define a limit for the size of values) my gut says we wouldn't store the blobs in lmdb directly as there is no concept of cross record value compression. It's not an issue for short strings, but it would be more desirable to pack and compress arbitrarily sized blobs together. We would basically be emulating what a Git pack file does. It wouldn't be hard to put together a nieve prototype, but its hard to do packs well...

Maybe that's premature though?

Of course we could provide the same partial clone semantics we are tooling for data records, but it's still very much in the works... I'm going to say for right now that I think it's needed and should be done. We just need to decide how much we care about disk space. If we don't, it's trivial, if we do, it's a project and I cannot see this getting to the top of my priority list anytime soon. (I think that even if you wanted to work on this, both if your time would be much better spent on user facing features)?

So... What are your thoughts? How important is this?

rlizzo commented 5 years ago

@hhsecond for the sake of not duplicating a bunch of documentation I've already written about the benefits of storing you data via the Hangar "philosophy", I'm going to point you to This Section Of the Docs.

Let me know if that addresses your point. It's extremely valid, and if the current explanations aren't clear or compelling enough I will make the time to improve the.

hhsecond commented 5 years ago

@rlizzo I have re-read the docs and here are my thoughts about the changes required. Let me know what do you think.

The Domain-Specific File Format Problem:

The benefit of keeping it in particular file format is intractability. With hangar, users are loosing it (to a great extent). We probably need to answer this concern in the doc (also, with import/export we could handle this)
Would be great if we can validate this statement with an example "there is an incredible amount of freedom (and power)"

Human & Computational Cost

It is not addressing the issue in a convincing way IMO. Even with hangar, they need to write the transformation pipeline (Yes, only once but it has to be written). The problem actually escalates with Hangar since the transformation code is not required for the rest of the steps in the deep learning workflow because we have transformed and saved the data in hangar. What if the old developer moves away and the new developer has no clue how the data is transformed to this stage where he has no access to the code used to make the transformation.

Others

"Standing on the shoulder of giants" -> we definitely need some examples here (I could make it up perhaps, but do let me know if you have some idea). Hangar under the hood explains the core of Hangar but probably not why Hangar and it's core data philosophy is better than other system in the market explicitly.
Is there a benefit from this "Two synced repositories can use completly different backends to store the data".

lantiga commented 5 years ago

Reviving this thread. Especially with partial fetch, I still think we should provide the ability to store individual blobs and version them. In fact they would behave like single-tensor dsets, maybe with a dedicated data store that just writes the blob to disk as a file instead of relying on a HDF5 file. Apart from this detail, we already have everything we need, we may only think about how to make it convenient from a UI perspective.

This would solve the model store problem with very little added complexity, with the benefit of partial fetching a certain version of a model.

tensorwerk / hangar-py

[FEATURE REQUEST] Model Storage #68