Open rlizzo opened 5 years ago
@lantiga let's continue our discussion here.
@rlizzo Why did we close this?
Sorry that was a mistake!
The long and short of it is that I'm leaning more towards not implementing this. I'll post more comments about my rationale later, but me and Luca talked today and came to similar conclusions.
Ok. so for reference, I've decided (with input from @lantiga) that we will not implement this inside of hangar. The rationale is this:
Keeping trained models alongside data in a hangar repository would be akin to adding compiled executables to commits alongside the code used to generate those executables in a Git
repository.
Hangar's primary purpose is to act as a distributed version control system for "data". While this is extremely relevant to the machine learning community, it is not a "machine learning specific" utility. If we build in specialized containers which only apply to machine learning, they have to stay forever, and need to be maintained even when the next "big thing" comes out. It's a cycle that repeats forever, and is just not good design.
Repositories are meant to be shared, and we imagine that they will be open sourced in many cases (just like Git
projects). Including binaries which are trained on a model with a specific purpose in mind is just bloat to someone who wants to fork or clone a repo and use the data contained in it in another application. Maybe the model would be useful to them, but most likely it's not. Using a project's repo as a dependency would get messy really quickly for anyone who wants to build on the project's data.
Sharing data in the open is an idea which most people seem to agree is good. Packaging a trained model binary file which resulted from a team's model code, compute/training time, AND the data itself may prevent users who deal with proprietary applications from using Hangar at all. Enterprise users are a group which need to (at the very least) be considered in fundamental design decisions.
Even if we did add a trained model container to Hangar, there's very little chance that it will be useful in the future, as it may require a specific machine learning framework version to execute (and which may not be supported, or even available in the future).
For these reasons (and maybe a few more still in my head), Hangar will not create a custom container class to store machine learning models.
That being said, it's a very valid point that people don't want to retrain machine learning models if they want to go back in time. In the future, the supporters of the Hangar project [tensor]werk inc. will release a more focused tool to handle all of this. Some preliminary designs and ideas are in the works though the timeline for an open source alpha release is not yet know.
@hhsecond, is this a valid reason in your point of view? I'm open to hearing any counter arguments to this decision.
@rlizzo These points make absolute sense and keeping hangar a tool for "data" (only) is probably the best way of handling it as a project/product since hangar is designed to store numerical arrays. But, In that case, just hangar is going to be useful for a niche crowd because we are essentially trying to solve a problem which most of the people in the industry is not worried about or they are too big to use hangar (it might be just me, but this is what I have seen). That being said, I could see the absence of a tool like hangar as the reason organizations haven't done what you lay down in the above thread yet.
Hey @rlizzo and @hhsecond.
After sleeping on this, I think all we should still allow users to store binary blobs as a special case of the current functionality. Either store a blob as a dataset of one sample, where the sample is a byte array, or treat it as meta data (we may not be able to due to size considerations, not sure). So, not have explicit support for models as a concept per se, but a way to store a blob if you're so inclined. The same way git does: it's not best practice, but you can still do it if you really want it.
Models as a concept will ideally come later in a separate tool that will manage lineage across source code, data, hyperparameters, trained models, etc. It's quite possible that it will share significant portions of its source code with hangar (e.g. the Merkle DAG thing).
@lantiga : I suppose blob stores are a fair compromise here. If we go that route, I would keep essentially the same conventions (and probably just use most of the code) we define for metadata, namely:
The only reason this isn't totally trivial to implement is that Unlike metadata (which we define a limit for the size of values) my gut says we wouldn't store the blobs in lmdb directly as there is no concept of cross record value compression. It's not an issue for short strings, but it would be more desirable to pack and compress arbitrarily sized blobs together. We would basically be emulating what a Git pack
file does. It wouldn't be hard to put together a nieve prototype, but its hard to do packs well...
Maybe that's premature though?
Of course we could provide the same partial clone semantics we are tooling for data records, but it's still very much in the works... I'm going to say for right now that I think it's needed and should be done. We just need to decide how much we care about disk space. If we don't, it's trivial, if we do, it's a project and I cannot see this getting to the top of my priority list anytime soon. (I think that even if you wanted to work on this, both if your time would be much better spent on user facing features)?
So... What are your thoughts? How important is this?
@hhsecond for the sake of not duplicating a bunch of documentation I've already written about the benefits of storing you data via the Hangar
"philosophy", I'm going to point you to This Section Of the Docs.
Let me know if that addresses your point. It's extremely valid, and if the current explanations aren't clear or compelling enough I will make the time to improve the.
@rlizzo I have re-read the docs and here are my thoughts about the changes required. Let me know what do you think.
The Domain-Specific File Format Problem:
Human & Computational Cost
Others
Reviving this thread. Especially with partial fetch, I still think we should provide the ability to store individual blobs and version them. In fact they would behave like single-tensor dsets, maybe with a dedicated data store that just writes the blob to disk as a file instead of relying on a HDF5 file. Apart from this detail, we already have everything we need, we may only think about how to make it convenient from a UI perspective.
This would solve the model store problem with very little added complexity, with the benefit of partial fetching a certain version of a model.
Is your feature request related to a problem? Please describe. Allow binary model storage for trained models
Describe the solution you'd like A clear and concise description of what you want to happen: