Consider splitting/refactoring `base_test.py`

axelboc commented 2 months ago

Very low priority, but I find that base_test.py is becoming quite long and is, generally speaking, difficult to read. I feel like we could make things clearer at least by:

having multiple "base" test files -- maybe at least one per endpoint (though /data/ could probably use multiple files)
splitting the HDF5 files creation logic out of the "base" test files.

Moreover, there are a lot of things to test:

API framework (Flask, Fast API, Tornado)
Endpoint (/data/, /meta/, etc.)
Query param (dtype=safe, format=bin, etc.)
HDF5 entity (group, dataset, datatype, soft links, external links
HDF5 type (int, float, bool, enum, compound, vlen, etc.)
HDF5 shape (empty, scalar, 1D, 2D+) & chunks
HDF5 attributes
HDF5 filters, virtual datasets, etc.
Errors (file/entity not found, etc.)

That's a lot of features and a lot of permutations if we really want to be thorough. It might be worth creating a few big HDF5 files like I did in H5Web (sample.h5) to make sure that we test /data/ and /meta/ with as many permutations of dtype, shape, format, etc., as we can.

loichuder commented 2 months ago

I do agree that there is room for improvement. Notably, I always find a bit misleading that the real tests are in base_test while most test_... files are in fact fixtures.

And I always need to take 5mn to understand how the fixtures are fed in the base tests.

Perhaps levering pytest's parametrize could allow us to make things clearer.

splitting the HDF5 files creation logic out of the "base" test files.

Not sure about this one. I like that I can check what HDF5 file is tested with one glance on the test. Writing tests with big HDF5 files requires external knowledge about these files.

axelboc commented 2 months ago

Notably, I always find a bit misleading that the real tests are in basetest while most test... files are in fact fixtures.

Haha true.

Not sure about this one. I like that I can check what HDF5 file is tested with one glance on the test. Writing tests with big HDF5 files requires external knowledge about these files.

It's just a lot of repetition: we initialise a dataset path and numpy value, create a file, create the dataset, access it, decode it, and assert that the returned value matches the initial value exactly. If we had one big dictionary of initial values, we could just loop through them to make sure h5grove supports them all and reads them all correctly. This could even be a single test, to be honest - i.e. a smoke test.

axelboc commented 2 months ago

This smoke test would work particularly well for the /data/ endpoint with format=bin.

Generally speaking, I would prefer to have one test per format and for each test to loop through multiple datasets, rather than one test per dataset and for each test to loop through multiple formats (what we have currently).

silx-kit / h5grove

Consider splitting/refactoring `base_test.py` #100