smerckel / dbdreader

A reader for binary data files created by Slocum ocean gliders (AUVs)
GNU General Public License v3.0
16 stars 14 forks source link

read compressed files #21

Closed jklymak closed 1 year ago

jklymak commented 1 year ago

It turns out the compression format used by Teledyne Webb is just LZ4, so we could inline reading dcd and ecd compressed files. I haven't looked at how to do that directly in dbdreader in detail yet, but basically compexp -x is just 1) getting the block size, and then 2) reading and decompressing blocks until the end of the file:

elif sys.argv[1] == 'x':

    with open(sys.argv[3], "wb") as f2:

        with open(sys.argv[2], "rb") as f1:

            while True:
                comp_size_bytes = f1.read(2)

                if comp_size_bytes:
                    compressed_size = int.from_bytes(comp_size_bytes, 'big')
                    compressed_block = f1.read(compressed_size)
                    if compressed_block:
                        uncompressed = lz4.block.decompress(compressed_block, uncompressed_size=CHUNK_SIZE)
                        f2.write(uncompressed)
                else:
                    break

Conversely, it's not a big deal to just add a step to our processing that does this before calling dbd reader.

smerckel commented 1 year ago

I haven't come across encoded binary files yet. I am happy to implement this, though. It also will make sense to do so as sooner or later other dbdreader users might want the same functionality from dbdreader. If I could receive such a compressed file for testing purposes, that would be very helpful indeed. If possible, a small file, including cac, that can be included in the tests. It is also fine if such a file can be fetched from some data repository.

jklymak commented 1 year ago

Great. A couple of missions would be:

Short mission: https://cproof.uvic.ca/gliderdata/card_offloads/dfo-k999/dfo-k999-20230720/

Full mission: https://cproof.uvic.ca/gliderdata/card_offloads/dfo-k999/dfo-k999-20230418/2023-05-14_K999/

and then under these in flight/logs and science/logs.

Please let me know if easier to tar up.

I guess there was a bit of a grumpy header in the teledyne Webb source code about "reverse engineering" their code. I don't think they would mind, and its pretty straightforward what they have done, but maybe we should ask them first.

smerckel commented 1 year ago

I have created a new branch "decompression" which implements easy decompression of binary files. I considered two modi operandi: repeatedly read original data files, and a single processing step to convert all glider binary files into some other format. The first use case is what I typically do myself. It then makes sense to convert all numerical files to long format files and decompress them at the same time. dbdrename.py does that now if supplying the -x option and reading compressed files. After this step, all is as it used to be. Using dbdrename.py might not be everyones cup-of-tea, in particular when the original compressed data files are to be read only once. The python-part of the DBD and MultiDBD modules need to read header of the file to extract some information from the files. This is done by decompressing the first block of 32kb by writing the content in a in-memory-file object, first. The C-part of DBD and MultiDBD, that is, when the parameters are de facto read, the whole file is first decompressed and written to an in-memory file. This means that the first block of each compressed file, is decompressed twice. The way things are organised, does not provide a way to avoid this (in an easy way). Using in-memory-files in C is easy in linux, but windows does not have such an option. For now I chose just to write the compressed file to disk after decompressing. a 01600000.dcd would then get its decompressed companion 01600000.dbd, which will then be read. This assumes that the data directory is writeable.

To make the approach in windows as reliable as in linux, a third-party library providing the same functionality as fmemopen() can be used, or a fallback mechanism where different directories are tried if it is not possible to write decompressed data files. These approaches comes with quite a bit of work, and I am not sure it is worth the effort. I have no clue if, or how many people use this software on windows.

Does it work on a mac? I have no clue. For now it follows the windows approach. fmemopen() is a posix extension. It may be implemented on macs. I have no way of testing that.

Any suggestions, feed back is welcome.