tensorwerk / hangar-py

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.
Apache License 2.0
204 stars 29 forks source link

Dynamic Arrayset Backend Update and Manual Configuration of Filter Options #133

Closed rlizzo closed 4 years ago

rlizzo commented 4 years ago

Motivation and Context

Why is this change required? What problem does it solve?:

Description

Describe your changes in detail:

print('set backend opts at init time')

opts = {'shuffle': True,
 'complib': 'blosc:lz4',
 'complevel': 3,
 'fletcher32': True}

dset_trimgs = co.arraysets.init_arrayset('train_images', prototype=sample_trimg, backend='00', backend_opts=opts)

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[0] = trimgs[0]

print('change to numpy backend with default opts')
dset_trimgs.change_default_backend('10')

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[1] = trimgs[1]

print('change to hdf5 again with different opts')

opts = {'shuffle': False,
 'complib': 'blosc:zstd',
 'complevel': 7,
 'fletcher32': True}
dset_trimgs.change_default_backend('00', opts=opts)

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[2] = trimgs[2]
print('records are independent files as expected')
 pp(dset_trimgs._sspecs)
set backend opts at init time
00
{'shuffle': True, 'complib': 'blosc:lz4', 'complevel': 3, 'fletcher32': True}
change to numpy backend with default opts
10
{}
change to hdf5 again with different opts
00
{'shuffle': False, 'complib': 'blosc:zstd', 'complevel': 7, 'fletcher32': True}
records are independent files as expected
{0: HDF5_00_DataHashSpec(backend='00', uid='n0nyjH', dataset='0', dataset_idx=0, shape=(784,)),
 1: NUMPY_10_DataHashSpec(backend='10', uid='2kUaI3', checksum='1257994616', collection_idx=0, shape=(784,)),
 2: HDF5_00_DataHashSpec(backend='00', uid='vuA0bJ', dataset='0', dataset_idx=0, shape=(784,))}

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Is this PR ready for review, or a work in progress?

How Has This Been Tested?

Put an x in the boxes that apply:

Checklist:

codecov[bot] commented 4 years ago

Codecov Report

Merging #133 into master will increase coverage by 0.45%. The diff coverage is 90.98%.

@@            Coverage Diff             @@
##           master     #133      +/-   ##
==========================================
+ Coverage   92.81%   93.26%   +0.45%     
==========================================
  Files          61       62       +1     
  Lines       10608    10923     +315     
  Branches     1041     1059      +18     
==========================================
+ Hits         9845    10187     +342     
+ Misses        555      518      -37     
- Partials      208      218      +10
Impacted Files Coverage Δ
src/hangar/records/parsing.py 98.51% <ø> (ø) :arrow_up:
tests/test_cli.py 100% <ø> (ø) :arrow_up:
src/hangar/dataloaders/tfloader.py 75% <ø> (+6.82%) :arrow_up:
src/hangar/__init__.py 100% <ø> (+33.33%) :arrow_up:
...gar/remote/request_header_validator_interceptor.py 100% <ø> (+37.78%) :arrow_up:
src/hangar/dataloaders/torchloader.py 100% <ø> (+5.41%) :arrow_up:
tests/test_initiate.py 100% <ø> (ø) :arrow_up:
tests/test_cli_io.py 100% <ø> (ø) :arrow_up:
tests/test_arrayset_backends.py 100% <100%> (ø)
tests/test_remotes.py 98.51% <100%> (ø) :arrow_up:
... and 22 more
rlizzo commented 4 years ago

@elistevens, care to take a look at this and let me know what you think? I've got a local branch with tests, but before finalizing that I want to make sure the API and utility is what we would expect.

elistevens commented 4 years ago

The API is functional, but doesn't strike me as friendly. A couple things that would make it more user-friendly:

backend = '00'
backend = 'hdf5'
backend = {
    'name': 'hdf5',
    'complib': 'blosc:lz4',
    'complevel': 3,
    'fletcher32': True,
}

co.arraysets.init_arrayset(
    'train_images', 
    prototype=sample_trimg, 
    backend=backend)

dset_trimgs.change_default_backend(backend)
dset_trimgs.change_default_backend(**backend)

I'm not usually a fan of arguments that can be different types, but I think it makes sense here.

rlizzo commented 4 years ago

Ok. this makes sense. Will look at it from the ground up before attempting to merge.

rlizzo commented 4 years ago

So I've gone another round at this.

  • Consolidate the backend= and backend_opts= params into something like this, such that the cross product of the below backend values and the call sites work:

Done! The format i've gone with is:

# for backend specification, but default options
co.arraysets.init_arrayset('foo', prototype=np.arange(10), backend_opts='00')

# for backend specification, and manual options
opts = {
    'backend': '00',
    'shuffle': 'byte',
    'complib': 'blosc:zstd',
    'complevel': 5,
    'fletcher32': True,
}
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts=opts)

# required opts depend on backend. `numpy` backend requires none, so
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts='10')
# is equivalent to
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts={'backend': '10'})
  • Allow setting global default backend+opts, so I can do that once on a project-wide module import and then everything is using the same setup, unless that particular callsite needs to customize.

I'd like to hold off on this one for now because:

1) This would have to go on the client side, and we don't have a good method to handle configuration for repo parameters yet (literally all we have is a single .ini file to read user_name and user_email) 2) In my view, It's actually not that necessary? The opts used to set up the storage backend are actually saved as part of arrayset's schema (version controlled in the repository). To hangar, these options are literally as much a part of the definition of an arrayset as the shape or dtype. Their record is permanently persisted, and available on either a read or write-enabled checkout. To create a different arrayset with them, just access the arraysets .backend_opts property, and feed it into the init_arrayset() function. This persists across a network as well... If you push to a hangar server, and I pull that arrayset from it, I'll automatically put the data in the same backend you did (with the same options).

I think it's a nice quality of life addition, but we need a real configuration management solution on the client before going down this road. How much of a pain will this be for you? (I'm probably being quite myopic here...)

  • Allow human-readable backend names instead of keys. 'hdf5' instead of '00' etc.

TBD. The format codes are the way everything is organized in the backend. the codes are needed to remain unambiguous about which backend a sample record spec belongs to (There's a 1000% chance that there will be more than hdf5 - and probably numpy/tdb - based backend in the future...) We chose not to go with names to avoid "human" problems in the first place.

I know it's a trivial mapping to make, but it's not necessary right now, and can be added in the future when there's more then 2-3 backends