uwmadison-chm / bioread

Utilities to work with files from BIOPAC's AcqKnowlege software
MIT License
66 stars 23 forks source link

Silent data dropping on Windows #44

Open expensne opened 11 months ago

expensne commented 11 months ago

Description

If using bioread.read_file() on large .acq files, the content of the channels (channel.data) is either wrong or 0.

Example

I wrote a little script that just outputs some statistics of each channel. Reading a 7h .acq measurement:

Output on Mac:

Channel Name                Length of Data  Data Min               Data Max             Data Sum             Data Mean
TSD115                      51387171        -0.03662109375         0.030517578125       -76605.6396484375    -0.0014907541738080406
SpO2, OXY100E               51387171        125.44656101661391     125.5006428006329    6447771653.388314    125.4743455985992
SKT100C_room                51387171        23.01269521660155      23.344150026110828   1191848045.9407477   23.193494071521233
SKT100C_sub                 51387171        29.594488971118135     29.72842724138181    1524068096.2418365   29.658532792977386
RSP100C                     51387171        -0.09613037109375      0.01373291015625     -1369612.0825195312  -0.026652801776527672
EDA100C                     51387171        -0.004578707756053291  0.01373183911894671  248566.91066513592   0.00483713942270019
PPG100C                     51387171        0.040283203125         0.1287841796875      4387299.780883789    0.0853773363177317
PPG100C                     51387171        -0.672607421875        -0.6341552734375     -33551446.154785156  -0.6529148326687444
Rate, OXY100E               51387171        508.248756045387       508.497546968006     26123345506.30136    508.36317699414434
EGG100C                     51387171        0.011138916015625      0.023956298828125    959572.4333190918    0.018673385100711065
EMG100C                     51387171        -0.04486083984375      0.048065185546875    90746.98333740234    0.0017659462774746316
EMG100C                     51387171        -1.07818603515625      1.0882568359375      144419.16244506836   0.0028104127865896406
EMG100C                     51387171        -0.6146240234375       0.621795654296875    120961.93649291992   0.002353932589379554
ECG100C_lead_1              51387171        -0.014190673828125     0.013275146484375    -20395.555572509766  -0.00039689975485340036
ECG100C_lead_2              51387171        -0.02105712890625      0.020599365234375    55009.0348815918     0.0010704818695232668
DI_synchronization_1        51387171        0.0                    5.0                  64017665.0           1.2457908025331847
DI_synchronization_2        51387171        0.0                    5.0                  63964270.0           1.2447517299599933
ECG100C_lead_3_calculation  51387171        -0.021514892578125     0.02593994140625     75404.59045410156    0.0014673816243766671

Output on Windows:

Channel Name                Length of Data  Data Min              Data Max              Data Sum             Data Mean
TSD115                      51387171        0.0                   0.0                   0.0                  0.0
SpO2, OXY100E               51387171        0.0                   0.0                   0.0                  0.0
SKT100C_room                51387171        32.2222222            32.2222222            1655808842.1915748   32.22222220000348
SKT100C_sub                 51387171        32.2222222            32.2222222            1655808842.1915748   32.22222220000348
RSP100C                     51387171        0.0                   0.0                   0.0                  0.0
EDA100C                     51387171        -0.07019150072480329  -0.07019150072480329  -3606942.6504927017  -0.07019150072481518
PPG100C                     51387171        0.0                   0.0                   0.0                  0.0
PPG100C                     51387171        0.0                   0.0                   0.0                  0.0
Rate, OXY100E               51387171        0.0                   0.0                   0.0                  0.0
EGG100C                     51387171        0.0                   0.0                   0.0                  0.0
EMG100C                     51387171        0.0                   0.0                   0.0                  0.0
EMG100C                     51387171        0.0                   0.0                   0.0                  0.0
EMG100C                     51387171        0.0                   0.0                   0.0                  0.0
ECG100C_lead_1              51387171        0.0                   0.0                   0.0                  0.0
ECG100C_lead_2              51387171        0.0                   0.0                   0.0                  0.0
DI_synchronization_1        51387171        0.0                   0.0                   0.0                  0.0
DI_synchronization_2        51387171        0.0                   0.0                   0.0                  0.0
ECG100C_lead_3_calculation  51387171        0.0                   0.0                   0.0                  0.0

I also tested it with even longer measurements. It produces always the above issue.

Error

No error is shown, it just drops the data it seems.

Env

Bioread version 3.0.1 Tested with Python 3.8, 3.9, 3.10, 3.11. Tested it on 3 different Windows machines (all Win 10) with 16GB RAM.

Notes

I noticed that the RAM usage goes rapidly up to 100% and then down again. Maybe here is the issue.

Script used

import bioread

def main(args):

    # get filepath ...

    data = bioread.read_file(args[0])
    assert data is not None

    lines = [["Channel Name", "Length of Data", "Data Min", "Data Max", "Data Sum", "Data Mean"]]

    for channel in data.channels:
        name = str(channel.name)
        data = channel.data

        lines.append([name, len(data), data.min(), data.max(), data.sum(), data.mean()]) # type: ignore

    # print ...

Full test script can be found here: https://github.com/expensne/bioread_test/

And .acq test files here: https://owncloud.fraunhofer.de/index.php/s/ukLl0x34UkYm3Or

1h.acq is working fine. 7h.acq is producing the above output.

njvack commented 11 months ago

Interesting! I don't have time to look into this right now, but! Is the "length of data" field probably correct?

And! If want to use bioread with very big files, I think you'll want to use the streaming API for this -- see Reader.stream() -- see

https://github.com/uwmadison-chm/bioread/blob/main/bioread/runners/acq2hdf5.py#L163

for an example of its usage.

Really, though, I might convert the files to HDF5 (assuming that works properly 🤞) and use an HDF5 library (which is probably going to be better than bioread in a lot of ways) for reading the data in your code.

expensne commented 11 months ago

Is the "length of data" field probably correct?

Yes.

Really, though, I might convert the files to HDF5 (assuming that works properly 🤞) and use an HDF5 library (which is probably going to be better than bioread in a lot of ways) for reading the data in your code.

Right, to transform it to HDF5 first sounds like a good idea. I'll do that!

njvack commented 11 months ago

Let me know if it works; it may not! This code is, um, not well-tested on large inputs. But my guess is that something is going horribly wrong when trying to read the whole thing into memory and maybe streaming it into another data structure will help.