Closed rlizzo closed 4 years ago
Merging #133 into master will increase coverage by
0.45%
. The diff coverage is90.98%
.
@@ Coverage Diff @@
## master #133 +/- ##
==========================================
+ Coverage 92.81% 93.26% +0.45%
==========================================
Files 61 62 +1
Lines 10608 10923 +315
Branches 1041 1059 +18
==========================================
+ Hits 9845 10187 +342
+ Misses 555 518 -37
- Partials 208 218 +10
Impacted Files | Coverage Δ | |
---|---|---|
src/hangar/records/parsing.py | 98.51% <ø> (ø) |
:arrow_up: |
tests/test_cli.py | 100% <ø> (ø) |
:arrow_up: |
src/hangar/dataloaders/tfloader.py | 75% <ø> (+6.82%) |
:arrow_up: |
src/hangar/__init__.py | 100% <ø> (+33.33%) |
:arrow_up: |
...gar/remote/request_header_validator_interceptor.py | 100% <ø> (+37.78%) |
:arrow_up: |
src/hangar/dataloaders/torchloader.py | 100% <ø> (+5.41%) |
:arrow_up: |
tests/test_initiate.py | 100% <ø> (ø) |
:arrow_up: |
tests/test_cli_io.py | 100% <ø> (ø) |
:arrow_up: |
tests/test_arrayset_backends.py | 100% <100%> (ø) |
|
tests/test_remotes.py | 98.51% <100%> (ø) |
:arrow_up: |
... and 22 more |
@elistevens, care to take a look at this and let me know what you think? I've got a local branch with tests, but before finalizing that I want to make sure the API and utility is what we would expect.
The API is functional, but doesn't strike me as friendly. A couple things that would make it more user-friendly:
'hdf5'
instead of '00'
etc.backend=
and backend_opts=
params into something like this, such that the cross product of the below backend
values and the call sites work:backend = '00'
backend = 'hdf5'
backend = {
'name': 'hdf5',
'complib': 'blosc:lz4',
'complevel': 3,
'fletcher32': True,
}
co.arraysets.init_arrayset(
'train_images',
prototype=sample_trimg,
backend=backend)
dset_trimgs.change_default_backend(backend)
dset_trimgs.change_default_backend(**backend)
I'm not usually a fan of arguments that can be different types, but I think it makes sense here.
Ok. this makes sense. Will look at it from the ground up before attempting to merge.
So I've gone another round at this.
- Consolidate the
backend=
andbackend_opts=
params into something like this, such that the cross product of the belowbackend
values and the call sites work:
Done! The format i've gone with is:
# for backend specification, but default options
co.arraysets.init_arrayset('foo', prototype=np.arange(10), backend_opts='00')
# for backend specification, and manual options
opts = {
'backend': '00',
'shuffle': 'byte',
'complib': 'blosc:zstd',
'complevel': 5,
'fletcher32': True,
}
co.arraysets.init_arrayset('foo', prototype=np.arange(10), backend_opts=opts)
# required opts depend on backend. `numpy` backend requires none, so
co.arraysets.init_arrayset('foo', prototype=np.arange(10), backend_opts='10')
# is equivalent to
co.arraysets.init_arrayset('foo', prototype=np.arange(10), backend_opts={'backend': '10'})
- Allow setting global default backend+opts, so I can do that once on a project-wide module import and then everything is using the same setup, unless that particular callsite needs to customize.
I'd like to hold off on this one for now because:
1) This would have to go on the client side, and we don't have a good method to handle configuration for repo parameters yet (literally all we have is a single .ini
file to read user_name
and user_email
)
2) In my view, It's actually not that necessary? The opts used to set up the storage backend are actually saved as part of arrayset
's schema (version controlled in the repository). To hangar, these options are literally as much a part of the definition of an arrayset
as the shape
or dtype
. Their record is permanently persisted, and available on either a read
or write-enabled
checkout. To create a different arrayset with them, just access the arraysets .backend_opts
property, and feed it into the init_arrayset()
function. This persists across a network as well... If you push to a hangar server, and I pull that arrayset
from it, I'll automatically put the data in the same backend you did (with the same options).
I think it's a nice quality of life addition, but we need a real configuration management solution on the client before going down this road. How much of a pain will this be for you? (I'm probably being quite myopic here...)
- Allow human-readable backend names instead of keys.
'hdf5'
instead of'00'
etc.
TBD. The format codes
are the way everything is organized in the backend. the codes are needed to remain unambiguous about which backend a sample record spec belongs to (There's a 1000% chance that there will be more than hdf5
- and probably numpy/tdb
- based backend in the future...) We chose not to go with names to avoid "human"
problems in the first place.
I know it's a trivial mapping to make, but it's not necessary right now, and can be added in the future when there's more then 2-3 backends
Motivation and Context
Why is this change required? What problem does it solve?:
Description
Describe your changes in detail:
Screenshots (if appropriate):
Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Is this PR ready for review, or a work in progress?
How Has This Been Tested?
Put an
x
in the boxes that apply:Checklist: