Open vanangamudi opened 4 years ago
Hi Selva, Thanks for trying out hangar and raising the issue. Assuming you are reaching here through Adam. If you have tried out DVC, you must be familiar with few downsides of DVC especially when it comes to performance. Hangar has another approach to version your data and it is built from ground up instead of relying on Git for versioning (storing tensors rather than blobs). We completely understand DVC's approach is useful for few folks and especially making data versioning go hand in hand with Git is specifically useful (we are trying to have a solution for this now which should also help with the metrics feature). We are making hangar to be part of user's code base and work easily with existing frameworks as well. I would be happy to take a call with you to guide you through few examples I had discussed with Adam. Please be aware that we have a slack user group where you'll get faster responses than here. Here is the link to the slack group. Also @rlizzo might have few more things to add here
Hey @vanangamudi,
Thanks for waiting on a reply here! I'll expand a bit upon @hhsecond's excellent summary above.
At first glance, Hangar, DVC, and DAT might appear to solve similar(ish) problems: versioning, making use of, and distributing/collaborating on "data". However, the implementation/design and world-view of each tool are drastically different; drastically impacting end-user workflows and performance.
The simplest way to understand why/how Hangar and DVC differ might be:
Git
) but specifically designed to deal with arbitrary numeric 'data'?"Git
and asking: "how can i add additional components and modules to this existing version control system in order to allow it to deal with arbitrary binary 'data' files?"A really important point in the above statements is the difference between what Hangar and DVC consider "data"
This is massively affects every aspect of usage, performance, and scaling abilities (as explained below).
Because Hangar thinks of data only as numeric arrays; there is no need for Hangar to consider domain specific storage conventions or formats! With a small set of schemas and technologies, Hangar
genereates highly optimized storage containers to read/write data on disk completely automatically and transparently for the user.
As a Hangar user, all you do is say:
>>> # write data
>>> checkout.arrayset['some_sample'] = np.array([0, 1, 2]) # some data
>>> # read data
>>> res = checkout.arrayset['some_sample']
>>> res
array([0, 1, 2])
In a Hangar workflow, there is
np.array
(or torch.Tensor
/ tf.Tensor
if using our built-in torch
/tensorflow
dataloaders..Hangar
directory). You'll never have to deal with file at all if data is stored in Hangar.Most importantly in Hangar: the numeric data you put in is exactly the numeric data you get out. While explaining how data is indexed/stored in the Hangar
backends is well beyond the scope of this article, it should be noted that the method which hangar stores the data is nearly completely arbitrary. Over time, the backend some piece of data resides in can (any often will) change or update. It is up to hangar to ensure that when you ask for some data sample, that numeric data is returned exactly as it went in. How it is stored is irrelevant to the end user (and to the majority of the Hangar core).
At it's core, DVC is dependent on the Git
project's code, as well as it's worldview. For the sake of brevity, I'll avoid venturing into inner workings / interactions between Git
/ DVC
; just know that the design as explained below follows from Git
's implementation & fundamental design.
Essentially, all DVC does is create a snapshot of some set of files (which the user marks as being "data" files, identified by either a filename suffix or via manually adding the file path to DVC). Because DVC operates in a Git
directory, AND because it thinks of "data" as some collection of bytes in a "file" on disk, any commit DVC generates of the "data files" will always return an exact snapshot of that "data"'s bytes (the file contents), file format (suffix), file name, and directory path.
In the DVC model, regardless of how the needs / processing pipeline / usage of some piece of data changes in the future, if you want to see data from some previous point in time, you get files written for the processes that exist at that point in time. In DVC, you must always retain:
This is a fundamental limitation of DVC because Git was written to Handle text files representing pieces of code. Thinking of "data" and "text" as analogous entities is a fallacy disproved by the following argument:
A text file is universally stable, the encoding is universally agreed upon, and it is a prerequisite for EVERY computer to be able to read a file containing text. This is NOT true for data files (and by extension, DVC). Data files are domain specific, ever changing, and can be very complex. Assuming that it will be readable in 10 years, with the same tools and code which we have today is just not reasonable, advisable, or good-practice in any way.
What you really want is the data itself, the directly computable set of numbers representing some piece of information in the real world, not the container in which it is stored. (ie. what you want is a system like Hangar)
The Hangar backends storage methods are highly optimized. Storing numeric data is our specialty, and the team has spent countless hours (and relied on many years of experience) to write backends which are highly performant for reads while balancing compression, shuffling, integrity checking, & multi-threading/processing considerations. Performance is a main consideration, and much work has gone into making sure that Hangar has some of the highest performance reads and compression levels around. I would suggest seeing this on a sample set of data you deal with in the real world.
Further, most hangar book-keeping operations (checkout, commit, diff, merge, fetch, clone, etc), do not actually require reading the numeric data files (which can be very large) from disk in order to execute. The vast majority of operations occur on only book-keeping "metadata" (very small - ~40 bytes each - structures acting to describe a commit / sample / data location). Combined with highly performant algorithms (similar to those used in the Git
core itself), means that common tasks in Hangar (checkout, diff, merge, etc) occur in less than a second, even for very large repositories / checkouts. Any operation which requires the actual numeric data on disk to be touched (ie. writing new data, or reading old), is an O(1)
operation (which generally has a small (1)
time).
Disk space is also further preserved by automatically deduplicating data. If you add some sample which has identical contents to any sample existing in the entirety of the Hangar
repository history, only the reference will be saved as an addition, the actual reference points to the previously saved sample which would be read upon request.
DVC stores what it is given. read speed / compression ratios are only as good as the files added to it. Without dedicated engineering efforts this commonly results in sub-par usability and increased costs through disk usage and cpu requirements during the reading / decoding phases. Also, for many operations DVC scales with O(N)
or O(N^2)
computational time complexity.
Git
and Hangar
commands in the same context.Many of the same points in relation to performance / workflow are analogous in the comparison of Hangar vs DVC
as to Hangar vs DAT
. However, DAT isn't even really a formal "version control" system. It is a networking protocol which handles arbitrary data in the form of "files". While it is certainly relevant to Hangar (more on this if interested), I don't see them as filling the same use case:
Hangar = Version Control + Distribution Protocol
DAT = Distribution Protocol
For further reading on the details above, I would encourage you to read up on the following section of the Hangar ReadTheDocs Site:
Also, since I never addressed your comment on DVC "metrics"
.
Hangar is a much more focused project than DVC. Rather than try to handle both execution, results tracking, and pipelining of specific workflow (ML graphs / training) in the same tool which is responsible for versioning and accessing your Data, we limit out scope to putting/retriving data on/from disk, versioning it, and enabling distribution/collaboration.
I liken adding pipeline/run features directly to the Hangar
core to be akin to Git
building in the functionality of Jenkins CI right into itself. It would be problematic for a few reasons:
While there isn't any built in support for "metrics" like DVC, the general nature of Hangar
makes writing your own metrics alongside any hangar commit
trivially easy:
>>> co.arraysets['metrics']['AUC'] = np.array([2.414])
>>> co.arraysets['metrics']['ROC'] = np.array([0.4522])
>>> # continue as needed
>>> co.arraysets['metrics']['AUC']
array([2.414])
>>> co.metadata['model1-AUC'] = str(2.414)
>>> res = co.metadata['model1-AUC']
>>> res
'2.414'
>>> float(res)
2.414
Hope this helps. let me know if you have any questions, comments, rebuttals, or concerns!
-Rick
Hi @rlizzo , thanks for the comprehensive explanation.
@rlizzo @hhsecond I've been looking for a version control system that can handle image pixels, and am really impressed by the comparison graphs you've shown against DVC. Thanks for all the effort the team put into building this tool from the ground up!
Unfortunately it looks like this project has gone stale though since the last commit was Sept 2, 2020. What happened in that regard? Is there anything that could be done to revive the project?
Executive Summary How does this compare with DVC.
Additional Context / Explantation I am not trying to start a flame war, but we spent quite a lot of time in investigating DVC[1] for our purpose. But one of my friend suggested to take a look at hangar. One key thing we really like about DVC is the metrics features. I read through the hangar docs, it looks a lot different from DVC but lot similiar to the dat project[2]. I may be wrong. Need some help with understanding the difference.
External Links [1] https://github.com/iterative/dvc [2] https://datproject.org/