hubverse-org / hubverse-transform

Data transform functions for hubverse model-output files
MIT License
0 stars 0 forks source link

Ensure that the hubverse-transform package respects Windows paths #15

Open bsweger opened 1 month ago

bsweger commented 1 month ago

Background

Thanks to @lshandross, we have some output from the hubverse-transform's test suite that indicates our path handling isn't working well on a Windows machine (see the attached file on this PR comment)

There are two underlying reasons for this:

  1. The package is currently designed to read/write data from S3. It can operate on local copies of model-output files but hasn't been extensively tested to do so in a cross-platform way
  2. To avoid network connections, the unit tests rely heavily on local operations (hence, the multiple test failures in a windows environment)

hubverse-transform should have better cross-platform support, but because it is designed for cloud-based operations, let's start by fixing the second item: making the test suite work on Windows machines. This ensures that Windows-based devs can contribute to the project.

As for the first item, local operations exist as a side-effect rather than a fully-formed feature, so it's not worth spending a ton of time here until there's an actual feature request.

Definition of done

bsweger commented 1 month ago

This points to a larger issue with the hubverse-transform unit tests: they rely heavily on local read/write operations because the code uses PyArrow FS, and you can't instantiate the pyarrow.fs.S3FileSystem class against an S3 bucket that doesn't exist (even if you don't plan to read from it or write to it).

Using local operations in lieu of integration tests against a true mocked AWS/S3 environment has caused us to miss at least one bug related to S3 file handling, so it might now be time to rethink our approach. We can't use moto (only works when the code base uses boto to access AWS), but maybe moto server? localstack? minio?