Add catalog dataset for HDF formats not supported by pandas

kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.

Apache License 2.0

91 stars 84 forks source link

Add catalog dataset for HDF formats not supported by pandas #240

Open tomaz-suller opened 2 years ago

tomaz-suller commented 2 years ago

Description

HDF files which were not created by pandas.to_hdf sometimes cannot be read using pandas.read_hdf. Since Kedro's HDFDataSet depends on pandas, these files cannot be added to the catalog all.

Context

Currently, the dynamic simulation software employed in my research group outputs exclusively .h5 files which contain information we wish to feed to ML models using Kedro. Currently we use a Python script which converts these HDF files into CSVs so we can track using Kedro, yet this is an inefficient process as we are required to rewrite thousands of files to process them.

Ideally, we would like to add out dataset to our data catalog just like we do with our CSV datasets, but in a way that was able to read any kind of HDF file, unlike kedro.extras.datasets.pandas.HDFDataSet.

Given pandas cannot read HDF files which do not conform to its specs (apparently by design according to this issue), this simple addition would benefit any user who may store information in HDF files, be it because it is their preferred storage method or (like in our case) they use some software which directly outputs HDF.

Possible Implementation

My research colleague @Eric-OG and I believe we can implement this on our own. I think it's worth noting it'd be our first contribution to an open source project, but we've read the guidelines and so forth.

It would basically involve copying one of the existing datasets (possible even pandas.HDFDataSet) and adapting it to use another library. We planned on using h5py.

Possible Alternatives

Performing part of our data processing pipelines without Kedro; this is cumbersome and can get harder to maintain, especially since our code will likely be used by new researchers next year;
Converting the files to another type already implemented; this is what we do today but it's simply inefficient.

datajoely commented 2 years ago

Hi @tomaz-suller @Eric-OG thanks for the issue - this is an interesting one!

Very exciting Kedro can be bestowed the honour of being your first open source contribution! We have contributing guidelines here and here. We can help you in this journey and the tutorial for defining / extending your own dataset can be found here.

As a rule we try not to alter the underlying API too much, that's why pandas limitation here becomes our limitation. In this case I think I agree it make sense to extend things so that your way of working is supported.

Shall we use this issue to discuss a proposed implementation before writing too much code? If you wouldn't mind posting a pseudocode / outline version of how you would extend pandas.HDFDataSet or simply defining a new low level hdf.h5py dataset?

Eric-OG commented 2 years ago

Hello @datajoely, our main idea behind using a interface that doesn't use pandas is avoiding using the pandas hdf5 interface since the data we were using couldn't be imported with pandas HDFStore. Therefore, we intend to implement a interface that reads the a HDF5 file with h5py and return it as an File object from h5py.

One instance of a dataset where we would apply this would be a simple .h5 file with a single dataset (a single key) with a basic array as data.

The outline of our ideia is the following:

_load Get load path Read file using fsspec Open file with h5py using context manager returns h5py._hl.files.File
_save Receives h5py._hl.files.File Get save path Transform h5 into binary Write using fsspec

By using the File object, it would be possible access data from a h5 file with any structure.

Our main doubt is if it is possible to do this, because of the way Kedro uses fsspec. Since the files are being written in binary and we don't know if it is possible to do this using h5py's interface. Is it possible to read and write directly with h5py (skipping the binary steps) in Kedro?

tomaz-suller commented 2 years ago

So, we figured out a way to read from binary to create a h5py.File object using the h5py low level API (in fact we copied the answers from this question on StackOverflow).

The basic idea would be to get the binary data from fsspec and turn it into a high-level h5py.File object which we can access as if reading the file straight from the local filesystem. This snippet would do the trick:

def h5py_from_binary(binary):
  file_access_property_list = h5py.h5p.create(h5py.h5p.FILE_ACCESS)
  file_access_property_list.set_fapl_core(backing_store=False)
  file_access_property_list.set_file_image(binary)

  file_id_args = {
    'fapl': file_access_property_list,
    'flags': h5py.h5f.ACC_RDONLY,
    'name': next(tempfile._get_candidate_names()).encode(),
  }
  h5_file_args = {
    'backing_store': False,
    'driver': 'core',
    'mode': 'r'
  }

  file_id = h5py.h5f.open(**file_id_args)
  return h5py.File(file_id, **h5_file_args)

We tested it on Google Colab and it worked, but there is one problem with it still: we don't quite understand what is going on there - seemingly a temp file is created with the contents of the binary and is passed to the h5py.File constructor, but the parameters passed to this file_access_property_list are a bit obscure. We are worried that would make the code hard to maintain in the long run.

Writing the file is less obscure: we write the h5py.File object into a binary buffer and then read the contents of the buffer to get the binary representation we'd pass to fsspec:

def h5py_to_binary(h5f):
  bio = BytesIO()
  with h5py.File(bio, 'w') as f:
    for key in h5_file.keys():
      f.create_dataset_like(key, h5_file[key])
    f.close()
    return bio.getvalue()

Edit: we found out about the create_dataset_like method from an issue on h5py's repo.

These solutions cover both the _save and _load operations an AbstractDataSet needs to implement. We'd then have to basically replace these methods from the pandas.HDFDataSet with something like the code I sent here.

You @datajoely mentioned it might be best not to write a ton of code before discussing what would be best, so tomorrow we'll try and get familiar with the Kedro build process to then implement the h5py.HDFDataSet after we hear back from you. Thanks for the links by the way!

tomaz-suller commented 2 years ago

I've also just realised the name of the file in the obscure file_id_args can be anything you want, you we'd be able to maintain the file name as an attribute of the h5py.File:

def h5py_from_binary(binary, filename):
  file_access_property_list = h5py.h5p.create(h5py.h5p.FILE_ACCESS)
  file_access_property_list.set_fapl_core(backing_store=False)
  file_access_property_list.set_file_image(binary)

  file_id_args = {
    'fapl': file_access_property_list,
    'flags': h5py.h5f.ACC_RDONLY,
    'name': filename.encode('utf-8'),
  }
  h5_file_args = {
    'backing_store': False,
    'driver': 'core',
    'mode': 'r'
  }

  file_id = h5py.h5f.open(**file_id_args)
  return h5py.File(file_id, **h5_file_args)

Edit: the str.encode('utf-8') came from this StackOverflow question.

I don't know if that's useful, but it makes the code a bit less weird and removes a dependency on tempfile (which is a native Python module anyway).

antonymilne commented 2 years ago

Agreed with @datajoely that this is an interesting one, thanks very much for raising the issue and for choosing kedro for your first open source contribution! Just shout if you need any help.

Just building on what Joel says regarding implementation, this is potentially a bit tricky to do correctly. The options are:

Extend existing pandas.HDFDataSet to handle your dataset. This would only work for a simple array though. Possible methods might be odo or this or pd.HDFStore (which underlies pd.read_hdf and pd.to_hdf)
Make a new hdf.h5py_dataset. We would need to make sure that this is done in such a way as to separate the IO part of operations into the dataset definition/kedro catalog, so that code run in a node itself does not mutate the underlying files. Looking at the h5py documentation, I believe this would mean making the _load/_save methods use Group (and not a File, which seems to be the most easiest/common way of using h5py).

As I see it, option 2 is preferred because it's the most general: this way a user can use kedro with any kind of h5 file, not just things which are coercible to a pandas DataFrame. You could store proper hierarchical data this way, and in the node pick out the bits you want and convert them to pandas DataFrame (or just numpy array or whatever). It also leaves open the possibility of a hdf.pytables_dataset in future that uses pytables rather than h5py.

However, it is potentially harder than option. In your case it sounds like you're only be interested in loading a single array into a pandas DataFrame? If this is what people generally do then probably we can get the majority of the useful functionality (being able to convert simple h5 files into pandas DataFrames) for less work than doing the fully general solution.

antonymilne commented 2 years ago

Oh sorry, I hadn't refreshed my page and so completely missed your posts, @Eric-OG and @tomaz-suller. Pretend that my comment above came directly after @datajoely's! 🤦

I'll take a proper read through what you wrote later, but just 3 random things for now:

What do you think of using Group rather than File as the fundamental object that is loaded/saved? Would this work?
Definitely agree with trying to avoid putting in too much obscure code that is hard to maintain if at all possible! If the issue is fsspec compatibility then it might be worth sacrificing support for s3 etc. in favour of having a robust local file storage implementation? In an ideal world this would work on all file systems though, and skimming through what you wrote I think we should be able to achieve that in a satisfactory way.
Another possible implementation: is there a way we could just use PartitionedDataSet? I'm not actually sure how these h5 files appear on disk - is it multiple files in one directory or just one file?

tomaz-suller commented 2 years ago

Hi @AntonyMilneQB. So, I'll try and answer the points you raised.

We would need to make sure that this is done in such a way as to separate the IO part of operations into the dataset definition/kedro catalog, so that code run in a node itself does not mutate the underlying files.

To be really honest with you, I didn't get what you mean... could you elaborate on it further? Do you mean we would have problems with the methods we suggested corrupting files, i.e. writing files to the filesystem which could only be read using our method? If so, we haven't tested that out yet, but it's something simple I'll do and post the results here.

In your case it sounds like you're only be interested in loading a single array into a pandas DataFrame? If this is what people generally do [...]

That is precisely what we want, meaning the approach you mentioned of only using a single Group object would work for our use case. I can't say though if this is what people wish to do most often. And since I didn't get what issue we'd have using File, it actually wouldn't make sense to restrict our implementation to Group - in the end we'd just like to implement the more generic solution using File and let the user decide what to do with the object.

[...] and skimming through what you wrote I think we should be able to achieve that in a satisfactory way.

Yeah, I do agree, just the fact that we didn't understand how the low-level API does its thing bugs me a bit.

Another possible implementation: is there a way we could just use PartitionedDataSet? I'm not actually sure how these h5 files appear on disk - is it multiple files in one directory or just one file?

Again, that is precisely what we wish to implement, but isn't PattitionedDataSet just something you add "on top" of an AbastractDataSet implementation? I'm asking that since a part of our catalog.yaml currently looks like this:

sim_dataset:
  layer: raw
  type: PartitionedDataSet
  path: data/01_raw/sim
  dataset:
    type: pandas.CSVDataSet
    save_args:
      index: False

I'm adding below some additional information about our use case as context, but it's not an essential read

Detailing our use case further: the dynamic simulator I mentioned initially outputs one pos.h5 file for each set of initial conditions, generating a structure like this:

$ tree -An
├── 0001
│   └── pos.h5
├── 0002
│   └── pos.h5
...
├── 7002
│   └── pos.h5
└── 7003
    └── pos.h5

These ~7000 files represent just 1 set of simulations - we need to analyse at least 20 of them. Anyway, each file contains information on the simulated conditions and the actual data generated, which we wish to analyse; we use the File API simply because, according to h5py's docs, its the usual interface for accessing HDF files. As an example:

>>> import h5py
>>> h5f = h5py.File('pos.h5')
>>> h5f.keys()
<KeysViewHDF5 ['combination_id', ..., 'time_series_data']>
# The file contain a single Group, and we are interested in particular in the time_series_data Dataset
>>> type(h5f['time_series_data'])
h5py._hl.dataset.Dataset

Eric-OG commented 2 years ago

Hello @AntonyMilneQB, regarding using Group instead of File:

What do you think of using Group rather than File as the fundamental object that is loaded/saved? Would this work?

According to what we searched, it would be necessary to use a File object, since we need to save the entire HDF5 hierarchy with the _save function. It is important to note that the File class actually extends the Group class, serving as an special group that represents the root of a HDF5 file.

Because of that, taking into consideration h5py's File implementation, the interface for File and Group should be pretty much the same. However, we need to use File objects because we would always access the root group.

tomaz-suller commented 2 years ago

Two new things.

Firstly, the method I sent here to turn a h5py.File into binary was wrong - it didn't actually copy data and didn't work with more than a single Group in one File. The code that does work is this:

def h5py_to_binary(h5f):
  bio = BytesIO()
  with h5py.File(bio, 'w') as biof:
    for key, value in h5f.items():
        h5f.copy(value, biof,
                 expand_soft=True,
                 expand_external=True,
                 expand_refs=True)
    biof.close()
    return bio.getvalue()

Secondly, complementing what @Eric-OG mentioned, @AntonyMilneQB we could get a Group from a File by simply using the Group.get() method on /:

>>> import h5py
>>> h5f = h5py.File('pos.h5')
>>> type(h5f.get('/'))
h5py._hl.group.Group

So yes, our solution would work using Group as well, since just like Eric noted, File extends from it. We'll start implementing the changes we thought about into a custom dataset for our Kedro repo only and see if more issues like the first one I mentioned here show up, but they shouldn't. If everything goes according to plan, we would now be able to read from and write to HDF files of any form, even those containing more than one Group per file.

datajoely commented 2 years ago

Hi @tomaz-suller I think we're at a good point in terms of thinking - it may be worth trying to scratch up an implementation. We have docs here on how to create a custom dataset: https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html

I would ask that you try and implement two Kedro style pieces of functionality:

Use fsspec so that the Kedro doesn't care if the files live locally, hdfs or s3 etc. You can steal the way we do this from any of our 1st party datasets like this one.
Try and implement versioning - there are instructions on how to do this in the tutorial above, however if this onc proves too hard to negotiate with the way HDF handles its own partitioning feel free to drop this request.

tomaz-suller commented 2 years ago

Hey @datajoely. @Eric-OG and I have experimented a bit with the functions I sent over here in the ImageDataSet example provided in the docs and it seems to work. We've now written an adaptation on top of pandas.HDFDataSet, since it was what inspired us in the first place and already employs reading and writing of binary data. The modification is on this fork Eric created. We're calling it hdf5.H5pyDataSet so people don't confuse it with something like HDFS and so we could have another e.g. hdf5.PyTablesDataSet like you guys suggested.

We'll test it in our own (private) Kedro project to iron out any clear bugs and then start adapting the documentation and writing the unit tests. By the way, I was considering creating the PR when we believe the dataset works and then work on the unit tests from the PR rather than from our fork, since we don't have that much experience in writing Python unit tests. What do you think?

On a side note, I tried installing Kedro's development dependencies as explained in the documentation but the build fails. I think it might have to do with me using a venv rather than a conda environment, but is that expected? The error I get is this:

Collecting pyproj<3.0,>=2.2.0
  Using cached pyproj-2.6.1.post1.tar.gz (545 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  ERROR: Command errored out with exit status 1:
   command: /mnt/ubuntu_home/tomaz/git/IC/kedro-environment/env/kedro-environment/bin/python /mnt/ubuntu_home/tomaz/git/IC/kedro-environment/env/kedro-environment/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpt2bamnei
       cwd: /tmp/pip-install-ixyer6k_/pyproj_f19efd7fe38b4162a1a6b9568a8a84df
  Complete output (1 lines):
  proj executable not found. Please set the PROJ_DIR variable.For more information see: https://pyproj4.github.io/pyproj/stable/installation.html
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/a8/f9/e7876096af700e7987f5e5abffa09b783ab31157083f76cd88c83e3dbe95/pyproj-2.6.1.post1.tar.gz#sha256=4f5b02b4abbd41610397c635b275a8ee4a2b5bc72a75572b98ac6ae7befa471e (from https://pypi.org/simple/pyproj/) (requires-python:>=3.5). Command errored out with exit status 1: /mnt/ubuntu_home/tomaz/git/IC/kedro-environment/env/kedro-environment/bin/python /mnt/ubuntu_home/tomaz/git/IC/kedro-environment/env/kedro-environment/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpt2bamnei Check the logs for full command output.

I also tried building with sudo, which I know is not the right way to do it, but still got the same error. If that is a bug I can create another issue and detail my environment better in there.

tomaz-suller commented 2 years ago

For transparency's sake, I'll be upfront and say @Eric-OG and I won't work on this issue for some time because of other obligations. I think we can be reasonably sure that we'll have at least opened the PR before the end of the year.

Meanwhile, if anyone's interested, our fork I linked in this thread has one initial implementation, though saving hasn't been manually tested thoroughly and we don't know how to integrate the save_args.

datajoely commented 2 years ago

Thanks @tomaz-suller I've added a 'Help wanted label' in case anyone else wants to have a go. Likewise the core team is unlikely to have time to takeover this request.

Regarding your point on save_args there is nothing wrong with only providing load_args in the first instance and simply raising a NotImplementedError or DataSetError on save like we do for the APIDataSet. Pandas also does something similar where they expose pd.read_sas method but no equivalent pd.DataFrame.to_sas one.

It's something we could revisit in another PR if users are asking for it.

tomaz-suller commented 1 year ago

I decided to come back to this issue and finally adapt the tests we needed so we could submit a PR.

All tests I've developed in our fork are currently passing, but I don't know how to get to 100% coverage. pytest complains a couple of lines on a parameter dictionary are not covered, and I just don't know what to do about it. This is the snippet, in kedro.extras.datasets.hdf5.h5py_dataset.__h5py_from_binary:

file_id_args = {
    "fapl": file_access_property_list,
    "flags": h5py.h5f.ACC_RDONLY,
    "name": next(tempfile._get_candidate_names()).encode(),
}

Another issue I'm facing is I can't make linting pass. I get this output out of pre-commit:

Trim Trailing Whitespace.................................................Passed
Fix End of Files.........................................................Passed
Check Yaml...........................................(no files to check)Skipped
Check JSON...........................................(no files to check)Skipped
Check for added large files..............................................Passed
Check for case conflicts.................................................Passed
Check for merge conflicts................................................Passed
Debug Statements (Python)................................................Passed
Fix requirements.txt.................................(no files to check)Skipped
Flake8...................................................................Passed
mypy.....................................................................Passed
blacken-docs.............................................................Passed
pyupgrade................................................................Passed
Sort imports.............................................................Passed
Black....................................................................Passed
Import Linter............................................................Passed
Secret scan..............................................................Passed
Bandit security check....................................................Failed
- hook id: bandit
- exit code: 2

[main]  INFO    profile include tests: None
[main]  INFO    profile exclude tests: None
[main]  INFO    cli include tests: None
[main]  INFO    cli exclude tests: None
[main]  INFO    running on Python 3.7.13
[main]  ERROR   No tests would be run, please check the profile.

Quick Pylint on kedro/*..................................................Passed
Quick Pylint on features/*...........................(no files to check)Skipped
Quick Pylint on tests/*..................................................Failed
- hook id: pylint-quick-tests
- exit code: 26

************* Module /home/tomaz/git/IC/kedro/pyproject.toml
pyproject.toml:1:0: R0022: Useless option value for '--disable', 'bad-continuation' was removed from pylint, see https://github.com/PyCQA/pylint/pull/3571. (useless-option-value)
************* Module Command line
Command line:1:0: R0022: Useless option value for '--disable', 'no-self-use' was moved to an optional extension, see https://pylint.pycqa.org/en/latest/whatsnew/2/2.14/summary.html#removed-checkers. (useless-option-value)
************* Module test_h5py_dataset
tests/extras/datasets/hdf5/test_h5py_dataset.py:12:0: E0401: Unable to import 'kedro.extras.datasets.hdf5.h5py_dataset' (import-error)
tests/extras/datasets/hdf5/test_h5py_dataset.py:13:0: E0401: Unable to import 'kedro.io' (import-error)
tests/extras/datasets/hdf5/test_h5py_dataset.py:14:0: E0401: Unable to import 'kedro.io.core' (import-error)
tests/extras/datasets/hdf5/test_h5py_dataset.py:156:17: C2801: Unnecessarily calls dunder method __enter__. Invoke context manager directly. (unnecessary-dunder-call)
tests/extras/datasets/hdf5/test_h5py_dataset.py:151:0: I0021: Useless suppression of 'no-member' (useless-suppression)

------------------------------------------------------------------
Your code has been rated at 8.79/10 (previous run: 5.68/10, +3.11)

I assume our fork being a bit over a year old might be to blame, but I couldn't find any issue in the GitHub repo related to changes to the configuration files mentioned in the error messages.

cc @Eric-OG

AhdraMeraliQB commented 1 year ago

Closing this issue as our datasets now live in the kedro-plugins repository. Feel free to re-open this issue / migrate the PR contribution over 😄 .

tomaz-suller commented 1 year ago

I was actually thinking about commenting about this just the other day. I'd still be open to contribute, though what struck me was this sat for about two years and there seems not to be any other request for it.

Would it make sense to add this dataset from a maintenance perspective? My view is it wouldn't just to support this one niche, but I'd like to get the perspective of Kedro developers as well.

AhdraMeraliQB commented 1 year ago

Hi @tomaz-suller - thanks for raising this concern. Generally some datasets will always be more niche than others. Our rule of thumb is as long as the dataset is introduced with a suitable test suite, we are more than happy to accept the contribution!

astrojuanlu commented 1 year ago

https://github.com/kedro-org/kedro/pull/1964 was an attempt to close this issue, I transferred it here and I'm reopening for visibility