This closes #112 by introducing in the reader and writer the ability to manage compression. This handling is fully done on the pandas side.
Supported compression formats are the ones supported by pandas.read_csv itself: .gz, .bz2, .zip, .xz, .zst, .tar, .tar.gz, .tar.xz or .tar.bz2.
Mandatory big thanks to Andrey Abramov who suggested the implementation and simplified this a lot for me.
The changes are:
The read_tfs function now makes use of pandas.read_csv as a context manager when reading the non-data part of the file. It iterates through the file 1 chunk (line) at a time with an identical logic as previously, but can now go through compressed files.
The write_tfs function uses pandas.io.common.get_handle to write to the file instead of a file descriptor from pathlib.Path as previously, which allows compression. Only the context manager line is changed.
Comments were added in the relevant places.
The function docstrings have been updated with an admonition reagarding how to handle compression, and an additional example for it.
Tests have been added tests/test_compression to check only the validity of data in case of compression (load compressed, check with uncompressed etc). Relevant compressed files have been added to the inputs.
The zstandard library is added as a test dependency (for the .zst format tests) but not a package dependency. Should a user want to open a file.tfs.zst they will get an error message that the library is needed).
Changelog and readme have been updated.
The CI workflows have been updated, where the [hdf5] extra is now installed before tests, and the step using deprecated set-ouput syntax of GitHub Actions is removed (full python version is available at the previous step now).
Examples
Importantly, thanks to pandas this new functionality is completely transparent to the user. The compression format is inferred from the file extension and managed automatically. See below for an example:
Caveats / to be discussed:
[x] As mentioned, I have added the zstandard library as a test dependency only. I am quite confident we should not make it a package dependency but this is up for discussion.
[x] In the reader when parsing non-data lines, I need to specify a separator (sep argument) to pandas.read_csv so it does not use the default ,: using the default would split any header value containing a comma and mess the parsing. At the moment, I have used a string that I don't expect to be found in any header of any file, that I have generated randomly, but maybe instead of hard-coding this we can think of a better idea?
[x] Due to (I assume) protocol changes in compression between Python 3.7 and above versions, if the compressed test files were generated with 3.8 or above then the 3.7 tests would fail, and inversely. Considering 3.7 is officially not supported anymore by pandas and is about 4 months away from EoL, I decided to skip the compression tests on Python 3.7 in the CI.
Compression Handling Feature
This closes #112 by introducing in the reader and writer the ability to manage compression. This handling is fully done on the
pandas
side.Supported compression formats are the ones supported by
pandas.read_csv
itself:.gz
,.bz2
,.zip
,.xz
,.zst
,.tar
,.tar.gz
,.tar.xz
or.tar.bz2
.Mandatory big thanks to Andrey Abramov who suggested the implementation and simplified this a lot for me.
The changes are:
read_tfs
function now makes use ofpandas.read_csv
as a context manager when reading the non-data part of the file. It iterates through the file 1 chunk (line) at a time with an identical logic as previously, but can now go through compressed files.write_tfs
function usespandas.io.common.get_handle
to write to the file instead of a file descriptor frompathlib.Path
as previously, which allows compression. Only the context manager line is changed.tests/test_compression
to check only the validity of data in case of compression (load compressed, check with uncompressed etc). Relevant compressed files have been added to the inputs.zstandard
library is added as a test dependency (for the.zst
format tests) but not a package dependency. Should a user want to open afile.tfs.zst
they will get an error message that the library is needed).[hdf5]
extra is now installed before tests, and the step using deprecatedset-ouput
syntax of GitHub Actions is removed (full python version is available at the previous step now).Examples
Importantly, thanks to
pandas
this new functionality is completely transparent to the user. The compression format is inferred from the file extension and managed automatically. See below for an example:Caveats / to be discussed:
zstandard
library as a test dependency only. I am quite confident we should not make it a package dependency but this is up for discussion.sep
argument) topandas.read_csv
so it does not use the default,
: using the default would split any header value containing a comma and mess the parsing. At the moment, I have used a string that I don't expect to be found in any header of any file, that I have generated randomly, but maybe instead of hard-coding this we can think of a better idea?3.7
and above versions, if the compressed test files were generated with3.8
or above then the3.7
tests would fail, and inversely. Considering3.7
is officially not supported anymore bypandas
and is about 4 months away from EoL, I decided to skip the compression tests on Python3.7
in the CI.