rlizzo commented 4 years ago

Motivation and Context

Why is this change required? What problem does it solve?:

Ability for backend filter opts to be set at arrayset init time
Added user facing calls which allow backend and opts to by changed at any time in the future, without rewriting all the data contained in an arrayset

Description

Describe your changes in detail:

print('set backend opts at init time')

opts = {'shuffle': True,
 'complib': 'blosc:lz4',
 'complevel': 3,
 'fletcher32': True}

dset_trimgs = co.arraysets.init_arrayset('train_images', prototype=sample_trimg, backend='00', backend_opts=opts)

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[0] = trimgs[0]

print('change to numpy backend with default opts')
dset_trimgs.change_default_backend('10')

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[1] = trimgs[1]

print('change to hdf5 again with different opts')

opts = {'shuffle': False,
 'complib': 'blosc:zstd',
 'complevel': 7,
 'fletcher32': True}
dset_trimgs.change_default_backend('00', opts=opts)

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[2] = trimgs[2]
print('records are independent files as expected')
 pp(dset_trimgs._sspecs)

set backend opts at init time
00
{'shuffle': True, 'complib': 'blosc:lz4', 'complevel': 3, 'fletcher32': True}
change to numpy backend with default opts
10
{}
change to hdf5 again with different opts
00
{'shuffle': False, 'complib': 'blosc:zstd', 'complevel': 7, 'fletcher32': True}
records are independent files as expected
{0: HDF5_00_DataHashSpec(backend='00', uid='n0nyjH', dataset='0', dataset_idx=0, shape=(784,)),
 1: NUMPY_10_DataHashSpec(backend='10', uid='2kUaI3', checksum='1257994616', collection_idx=0, shape=(784,)),
 2: HDF5_00_DataHashSpec(backend='00', uid='vuA0bJ', dataset='0', dataset_idx=0, shape=(784,))}

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

[ ] Documentation update
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

[x] Ready for review
[ ] Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

[ ] Current tests cover modifications made
[x] New tests have been added to the test suite
[x] Modifications were made to existing tests to support these changes
[ ] Tests may be needed, but they are not included when the PR was proposed
[ ] I don't know. Help!

Checklist:

[x] My code follows the code style of this project.
[x] My change requires a change to the documentation.
[x] I have updated the documentation accordingly.
[x] I have read the CONTRIBUTING document.
[x] I have signed (or will sign when prompted) the tensorwork CLA.
[x] I have added tests to cover my changes.
[x] All new and existing tests passed.

codecov[bot] commented 4 years ago

Codecov Report

Merging #133 into master will increase coverage by 0.45%. The diff coverage is 90.98%.

@@            Coverage Diff             @@
##           master     #133      +/-   ##
==========================================
+ Coverage   92.81%   93.26%   +0.45%     
==========================================
  Files          61       62       +1     
  Lines       10608    10923     +315     
  Branches     1041     1059      +18     
==========================================
+ Hits         9845    10187     +342     
+ Misses        555      518      -37     
- Partials      208      218      +10

Impacted Files	Coverage Δ
src/hangar/records/parsing.py	`98.51% <ø> (ø)`	:arrow_up:
tests/test_cli.py	`100% <ø> (ø)`	:arrow_up:
src/hangar/dataloaders/tfloader.py	`75% <ø> (+6.82%)`	:arrow_up:
src/hangar/__init__.py	`100% <ø> (+33.33%)`	:arrow_up:
...gar/remote/request_header_validator_interceptor.py	`100% <ø> (+37.78%)`	:arrow_up:
src/hangar/dataloaders/torchloader.py	`100% <ø> (+5.41%)`	:arrow_up:
tests/test_initiate.py	`100% <ø> (ø)`	:arrow_up:
tests/test_cli_io.py	`100% <ø> (ø)`	:arrow_up:
tests/test_arrayset_backends.py	`100% <100%> (ø)`
tests/test_remotes.py	`98.51% <100%> (ø)`	:arrow_up:
... and 22 more

rlizzo commented 4 years ago

@elistevens, care to take a look at this and let me know what you think? I've got a local branch with tests, but before finalizing that I want to make sure the API and utility is what we would expect.

elistevens commented 4 years ago

The API is functional, but doesn't strike me as friendly. A couple things that would make it more user-friendly:

Allow human-readable backend names instead of keys. 'hdf5' instead of '00' etc.
Allow setting global default backend+opts, so I can do that once on a project-wide module import and then everything is using the same setup, unless that particular callsite needs to customize.
Consolidate the backend= and backend_opts= params into something like this, such that the cross product of the below backend values and the call sites work:

backend = '00'
backend = 'hdf5'
backend = {
    'name': 'hdf5',
    'complib': 'blosc:lz4',
    'complevel': 3,
    'fletcher32': True,
}

co.arraysets.init_arrayset(
    'train_images', 
    prototype=sample_trimg, 
    backend=backend)

dset_trimgs.change_default_backend(backend)
dset_trimgs.change_default_backend(**backend)

I'm not usually a fan of arguments that can be different types, but I think it makes sense here.

rlizzo commented 4 years ago

Ok. this makes sense. Will look at it from the ground up before attempting to merge.

rlizzo commented 4 years ago

So I've gone another round at this.

Consolidate the backend= and backend_opts= params into something like this, such that the cross product of the below backend values and the call sites work:

Done! The format i've gone with is:

# for backend specification, but default options
co.arraysets.init_arrayset('foo', prototype=np.arange(10), backend_opts='00')

# for backend specification, and manual options
opts = {
    'backend': '00',
    'shuffle': 'byte',
    'complib': 'blosc:zstd',
    'complevel': 5,
    'fletcher32': True,
}
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts=opts)

# required opts depend on backend. `numpy` backend requires none, so
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts='10')
# is equivalent to
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts={'backend': '10'})

Allow setting global default backend+opts, so I can do that once on a project-wide module import and then everything is using the same setup, unless that particular callsite needs to customize.

I'd like to hold off on this one for now because:

1) This would have to go on the client side, and we don't have a good method to handle configuration for repo parameters yet (literally all we have is a single .ini file to read user_name and user_email) 2) In my view, It's actually not that necessary? The opts used to set up the storage backend are actually saved as part of arrayset's schema (version controlled in the repository). To hangar, these options are literally as much a part of the definition of an arrayset as the shape or dtype. Their record is permanently persisted, and available on either a read or write-enabled checkout. To create a different arrayset with them, just access the arraysets .backend_opts property, and feed it into the init_arrayset() function. This persists across a network as well... If you push to a hangar server, and I pull that arrayset from it, I'll automatically put the data in the same backend you did (with the same options).

I think it's a nice quality of life addition, but we need a real configuration management solution on the client before going down this road. How much of a pain will this be for you? (I'm probably being quite myopic here...)

Allow human-readable backend names instead of keys. 'hdf5' instead of '00' etc.

TBD. The format codes are the way everything is organized in the backend. the codes are needed to remain unambiguous about which backend a sample record spec belongs to (There's a 1000% chance that there will be more than hdf5 - and probably numpy/tdb - based backend in the future...) We chose not to go with names to avoid "human" problems in the first place.

I know it's a trivial mapping to make, but it's not necessary right now, and can be added in the future when there's more then 2-3 backends

tensorwerk / hangar-py

Dynamic Arrayset Backend Update and Manual Configuration of Filter Options #133

Motivation and Context

Why is this change required? What problem does it solve?:

Description

Describe your changes in detail:

Screenshots (if appropriate):

Types of changes

How Has This Been Tested?

Checklist:

Codecov Report