ratal / mdfreader

Read Measurement Data Format (MDF) versions 3.x and 4.x file formats in python
Other
169 stars 73 forks source link

Incorrect record length #189

Closed ludwig-nc closed 3 years ago

ludwig-nc commented 3 years ago

Python version

Python 3.7.4

Platform information

OSX and Linus

Numpy version

'1.16.2'

mdfreader version

4.1

Description

Hi Aymeric,

sorry for bothering you again. I think I narrowed down my problem but I need your help to investigate further.

I am reading in a MDF4 file but the output is of incorrect length. It doesn't matter whether I am using the dataRead module or not and whether I am using a channel_list or not the result length is the same:

I changed the channel names, since I am not allowed to share the data:

dat = mdfreader.Mdf(fname, channel_list=['data_channel'])
print(len(dat['time_channel']['data']))
print(len(dat['data_channel']['data']))

>>> 137646 
>>> 137646

I am pretty certain that this is wrong, because I get a record length of 137501 when using the asammdf package and it's exactly the last 145 entries in the array that contain incorrect information. This last 145 entries are either 0 without using the dataRead module or when dataRead is used the last 145 entries contain random data, i.e. if I repeat the read in I get different results for those values, indicating that the array is allocated in C with a length of 137646 but only the first 137501 elements get filled.

I know this is not enough information to identify exactly what is going wrong, but I hope you can tell me again which is the best part of the code to debug this.

ratal commented 3 years ago

Hi, Thanks for persevering :) Could you cross check the value would give MDFValidator from your file in the CGBlock linked to your faulty channel ? The cycle_count is the number of samples, I guess should be 137501 and not 137646. Are there invalid_bytes ? Still hard to point out part of code but there is a chance of a bug in mdf4reader.py, Record class, load_info() method. Is Record.numberOfRecords = 137501 ? If correct, maybe a bug in the way is read chunk by chunk. Is it a "big" file ? (your channel could bigger more than 100MB ? => then maybe something wrong in the way chunks are handled in Record class, generate_chunks() method)

ludwig-nc commented 3 years ago

Thanks for always answering so fast. I am sorry, I should have checked with MDFValidator earlier, but I am not usually working on a windows machine.

MDFValidator issues a warning that is related to the problem:

"calculated length from cycle count of child CG blocks does not match length of data block, possibly wrong cycle counters (count too high). Calculated length of data: 26978616, actual length: 26950196, expected cycles: 137501"

The cycle count of the CGBlock is indeed 137646 and so is numberOfRecords. Do I understand this correctly: the data is split in several chunks and the MDFValidator is complaining that the sum of the chunks is not corresponding to the total size listed in the DataGroup Header? I will try to find out how this could have happened.

This would also explain why there is a part of the data array where no data is written to and one is left with random values. I assume you register an array of size 137646 (since this is the cycle count in the DG header) but only the first 137501 will be filled when the chunks of data are read in one after the other.

This is definitely not an issue with the mdfreader, but do you see a way of how to handle this with the current implementation? Even if the read in would fail in such a case it would be helpful. Again, this is no bug but would it be possible to use "PyMem_Calloc" instead of "PyMem_Malloc" in dataread? This would at least consistently return 0 in such a case which would be easier to deal with.

ratal commented 3 years ago

The cycle count of the CGBlock is indeed 137646 and so is numberOfRecords. Do I understand this correctly: the data is split in several chunks and the MDFValidator is complaining that the sum of the chunks is not corresponding to the total size listed in the DataGroup Header? I will try to find out how this could have happened.

Chunks error could have come from mdfreader handling wrongly chunks at read but as the numberOfRecords is as expected, no issue there. MDFValidator tells you something went wrong during recording. Your recorder did not finally record as he expected. There is a high chance there are invalid_bytes, did you check ? If there is, you can apply for it using maskedarray with apply_all_invalid_bit() method. In the IDBlock, could you tell what is the value of id_unfinflags ? If some of the bits are set, specific recovery code could be written to update the cyclecount (numberOfRecords) based on block and record size rather than on CGBlock value (more expected behavior than PyMem)

ludwig-nc commented 3 years ago

thanks again, for dealing with this, since it's not really a problem of the mdfreader. 'id_unfi_flags': 0, I guess that means that no such recovery is possible.

The invalid channel and invalid bit fields are set for the data channel (not for the time channel, but I think that is normal). I did not implement the invalid bit handling in our data pipeline, but if I go through this code the bit in question does not seem to be set for the range between 137501 and 137646. We also only use this to mask the data values (not the time values) as NaN, which also wouldn't help in my case. The problem did arise because I am looking for the maximum in the time channel (which sometimes results in > 1e9 and messes things up).

danielhrisca commented 3 years ago

Hello guys,

I think this happens when the writing application pre-allocates the space on disk for the data block, so when the measurement is stopped the final block may not be completely filled. In this case you should consider the channel group cycle count when loading the data from the file.

Of course the nice thing would be if the writing application would also update the block_len field of the last data block when finishing the measurement.

ratal commented 3 years ago

Hi, Did you try to use apply_all_invalid_bit() method ? It should convert your channel into masked array. then you can identify and get the mask with .mask and check for the last elements. Thanks @danielhrisca for the comment, mdfreader considers the CGBlock cycle count by default, I would rather suspect invalid bits that are not well handled.

ludwig-nc commented 3 years ago

I'll be on vacation till August 11th. I'll try then. Thanks a lot

Outlook für Androidhttps://aka.ms/ghei36 herunterladen


From: Aymeric Rateau notifications@github.com Sent: Friday, July 31, 2020 11:36:36 PM To: ratal/mdfreader mdfreader@noreply.github.com Cc: Oser, Ludwig Ludwig.Oser@norcom.de; Author author@noreply.github.com Subject: Re: [ratal/mdfreader] Incorrect record length (#189)

Hi, Did you try to use apply_all_invalid_bit() method ? It should convert your channel into masked array. then you can identify and get the mask with .mask and check for the last elements.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ratal/mdfreader/issues/189#issuecomment-667384351, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHHRIGMBMFBVNGZ5K754YJDR6M2OJANCNFSM4PFV2FVA.

danielhrisca commented 3 years ago

Do you also consider the cycles count when you read the raw data blocks?

ratal commented 3 years ago

Only cycle count from the CG Block, cg_cycle_count, no deduction from block size or cross check of size from block start positions (conceivable for advanced recovery functions, but not there)

danielhrisca commented 3 years ago

This was what I was trying to explain. That the raw block size is not consistent with the cycle nr so you end up reading extra Bytes compared to the expected array size

ratal commented 3 years ago

But after checking the numbers in above comments, the cycle count matches the block size and record size. The wrong data are in the block, that is why I rather suspect a wrong handling of invalid bits, either by mdfreader or the logger.

danielhrisca commented 3 years ago

Thanks for always answering so fast. I am sorry, I should have checked with MDFValidator earlier, but I am not usually working on a windows machine.

MDFValidator issues a warning that is related to the problem:

"calculated length from cycle count of child CG blocks does not match length of data block, possibly wrong cycle counters (count too high). Calculated length of data: 26978616, actual length: 26950196, expected cycles: 137501"

The cycle count of the CGBlock is indeed 137646 and so is numberOfRecords. Do I understand this correctly: the data is split in several chunks and the MDFValidator is complaining that the sum of the chunks is not corresponding to the total size listed in the DataGroup Header? I will try to find out how this could have happened.

This would also explain why there is a part of the data array where no data is written to and one is left with random values. I assume you register an array of size 137646 (since this is the cycle count in the DG header) but only the first 137501 will be filled when the chunks of data are read in one after the other.

This is definitely not an issue with the mdfreader, but do you see a way of how to handle this with the current implementation? Even if the read in would fail in such a case it would be helpful. Again, this is no bug but would it be possible to use "PyMem_Calloc" instead of "PyMem_Malloc" in dataread? This would at least consistently return 0 in such a case which would be easier to deal with.

This MDFValidator output makes me think that the problem is the one I've stated

ratal commented 3 years ago

Yes, correct Daniel. I understood that cycle count is higher than the data block contains, reading at the end the next following block, not random data. Another block could have been written by the logger on top of data block by mistake, logger did not reflect actual cycle count after some corrections.. there could be a lot of reasons but most probably a bug from mdf writer. MDFValidator is able to catch this error most probably by registering all blocks pointers and analyzing gaps in between from block content, it is made for that. But I am afraid that implementing this kind of check (would have to flatten all blocks and sort them by its pointers then check gaps, or refactor the mdfinfo classes, complicated) would impact performance for default use case. It could be triggered by optional parameter though and put apart for occasional cases. So @ludwig-nc , how often do you encounter this issue ? Do you have a lot of problematic files that you absolutely want to read? Do you get support from your mdf writer/logger to report this bug ? In a nutshell, is it worth having a fix for you ? @danielhrisca , do you perform this kind of check in asamMdf ?

danielhrisca commented 3 years ago

@danielhrisca , do you perform this kind of check in asamMdf ?

I have such a function for detecting the start addresses of all the blocks https://github.com/danielhrisca/asammdf/blob/master/asammdf/blocks/utils.py#L1442 but I only use it for the finalization steps.

Yes, correct Daniel. I understood that cycle count is higher than the data block contains, reading at the end the next following block, not random data. Another block could have been written by the logger on top of data block by mistake, logger did not reflect actual cycle count after some corrections.. there could be a lot of reasons but most probably a bug from mdf writer.

I did not catch this problem at first glance. I though the cycle count was actually lower then the block size.

ratal commented 3 years ago

@danielhrisca , thanks for the tip. @ludwig-nc , can you check with MDF Validator if there is any any unfinalised flag in the IDBlock ?

ratal commented 3 years ago

In dev branch, I added a new function to finalise mdf files and eventually correct cycle counts. Even if file is not flag to be finalised, you could force the scan with argument force_file_integrity_check=True -> you could give a try with your file ?

ludwig-nc commented 3 years ago

Sorry for being inactive for a while, I was swamped with other stuff. I get the following error when using the flag in mdfreader.Mdf(file, force_file_integrity_check=True)

~/src/github/mdfreader/target/lib/mdfreader/mdf4reader.py in file_finalization(version, info, fid, finalization_writing_to_file, force_file_integrity_check)
   2842                                 dl.read_dl(fid, dl['link_count'])
   2843                                 for pointer in dl['list_data'][0]:
-> 2844                                     dz.read_dz(fid, pointer)
   2845                                     data_block_size_in_file = dz['dz_org_data_length']
   2846                                     data_size_in_file += data_block_size_in_file

TypeError: read_dz() takes 2 positional arguments but 3 were given

when I check the function definition of read_dz in mdfinfo4.py it only expects one argument fid. Maybe the changes to mdfinfo4.py weren't pushed to the dev-branch?

ratal commented 3 years ago

Hi @ludwig-nc , I pushed modification that have been bit more tested, this error should not happen anymore. Can you try again and check if it improves your situation ?

ludwig-nc commented 3 years ago

Hi @ratal,

the errors are gone, but unfortunately it doesn't solve the problem. Maybe the files are more broken than I thought. When using force_file_integrity_check=True the length of the channel goes down from 137646 to 999, and not the expected 137501.

There is one warning:

mdfreader/target/lib/mdfreader/mdf4reader.py:2693: UserWarning: Checking last DL block
  warn('Checking last DL block')

There are multiple warnings like this:

mdfreader/target/lib/mdfreader/mdf4reader.py:2894: UserWarning: Data Blocks have a length of 246000 while it is expected to be 33985146
  warn(f'Data Blocks have a length of {data_size_in_file} '

Since the latter appears multiple times, I assume there are multiple Data Blocks that aren't read in leading to the far smaller data channel?

ratal commented 3 years ago

Hi @ludwig-nc Sorry for the late reply. For the warning, force_file_integrity_check=True forces the cycle count calculation for each data group based on the surrounding Block positions. Actually bit tricky if the writer does not have a compact writing between blocks, reading dead space that could make appear random data. Anyway, here it means that for each data group there is a cyclecount mismatch detected, which is probably wrong. 999 is indeed too small, not expected 137501 if MDFValidator is correct. There is probably an issue in the function file_finalization() at the end of mdf4reader.py, I did it without any reference or faulty file to fix, just estimation of what could be right. I guess the code from line 2777 has an issue. You could compare what gives MDFValidator for the block positions against the values of data_block_ending_position, data_size_in_file and expected_data_size for the faulty block.

ludwig-nc commented 3 years ago

Hi @ratal, Sorry, that I can't just provide the data. I guess that would make this whole process a lot easier. The code in line 2777 actually is not called in my case. The block_type always seems to be of type '##HL'.

It seems that when I use force_file_integrity_check=True the expected_data_size differs from data_size_in_file by quite a bit for all 13 data groups in the file. MDF-Validator on the other hand is only complaining about one specific data group (11). I posted the error message above:

"calculated length from cycle count of child CG blocks does not match length of data block, possibly wrong cycle counters (count too high). Calculated length of data: 26978616, actual length: 26950196, expected cycles: 137501"

The calculated length of the MDF-Validator (26978616) corresponds to expected_data_size of this specific data group. The data_size_in_file seems to be far off the actual length though (data_size_in_file=196000)

Here is also the output for print("#", expected_data_size, data_size_in_file, dz_ending_position, hex(data_block_starting_position)) for all 13 data groups. Like I mentioned expected_data_size always differs from data_size_in_file even though it should only fail for group 11.

######
b'##HL'
# 33985146 246000 275496 0x252a0
b'##HL'
# 4697117 34000 301272 0x45560
b'##HL'
# 19202989 139004 412960 0x53c90
b'##HL'
# 1796015 13004 462008 0x6f240
b'##HL'
# 243677789 1763804 3116976 0x2219b0
b'##HL'
# 22104008 160000 3198512 0x2f94a0
b'##HL'
# 4420808 32000 3216040 0x30d318
b'##HL'
# 2210408 16000 3225576 0x311798
b'##HL'
# 1022347 12363 3239560 0x315550
b'##HL'
# 386834 12298 3250064 0x317e78
b'##HL'
# 26978616 196000 3390176 0x323b00
b'##HL'
# 6056391 44000 3414672 0x33c390
b'##HL'
# 115621842 840000 3841416 0x3433d8

I hope this is helpful in any way.

danielhrisca commented 3 years ago

Sorry, that I can't just provide the data. I guess that would make this whole process a lot easier.

@ludwig-nc what about if you could scramble all the text blocks so that the channel names and comments are just random letters? This way you can anonymize the data

ratal commented 3 years ago

Hi @ludwig-nc ,

HL, the most complicated.. I did a commit on dev branch that should improve the situation.

ludwig-nc commented 3 years ago

Hi @ratal, It did indeed improve the situation, the expected and actual sizes are now quite similar, but unfortunately the force_file_integrity_check still is trying to correct all the data groups, and not just group 11 (see expected and actual group sizes below). The group in question now has a length of 131499 which is pretty close to the size reported by MDF-Validator (137501). Any idea on what I should add as debug statements, that might be helpful?

print(expected_data_size, data_size_in_file)
33985146 32472000
4697117 4488000
19202989 18348004
1796015 1716004
243677789 232821604
22104008 21120000
4420808 4224000
2210408 2112000
1022347 976287
386834 368766
26978616 25774000
6056391 5786000
115621842 110460000

@danielhrisca I wouldn't know how to do that, I am afraid :) without reading the file with the mdfreader and then writing it again.

danielhrisca commented 3 years ago

try this


import re
import struct
from pathlib import Path
from random import randint

def randomized_string(size):
    """get a \0 terminated string of size length

    Parameters
    ----------
    size : int
        target string length

    Returns
    -------
    string : bytes
        randomized string

    """
    return bytes(randint(65, 90) for _ in range(size - 1)) + b"\0"

def scramble_mf4(name):
        """scramble text blocks and keep original file structure

        Parameters
        ----------
        name : str | pathlib.Path
            file name

        Returns
        -------
        name : pathlib.Path
            scrambled file name

        """

        name = Path(name)

        pattern = re.compile(
            rb"(?P<block>##(TX|MD))",
            re.DOTALL | re.MULTILINE,
        )

        texts = {}

        with open(name, 'rb') as stream:

            stream.seek(0, 2)
            file_limit = stream.tell()
            stream.seek(0)

            for match in re.finditer(pattern, stream.read()):
                start = match.start()

                if file_limit - start >= 24:
                    stream.seek(start + 8)
                    size = struct.unpac('<Q', stream.read(8))[0]

                    if start + size <= file_limit:
                        texts[start+24] = randomized_string(size-24)

        dst = name.with_suffix(".scrambled.mf4")

        copy(name, dst)

        with open(dst, "rb+") as mdf:
            for addr, bts in texts.items():
                mdf.seek(addr)
                mdf.write(bts)

        return dst
ratal commented 3 years ago

Hi @ludwig-nc , Tried to go in depth on what could happen with several kind of files close to yours (HL Blocks), but I could not find any issues. Your file might have a specificity I did not consider or could not test. Of course, all the (good) files I tested have equal expected_data_size and data_size_in_file. I tried to fake unfinalised files but I still cannot find your issue... I a bit clueless here.

ludwig-nc commented 3 years ago

Hi @ratal Thanks a lot for all your efforts! I understand that this is next to impossible to debug without the data files at hand, but thanks for giving it a try. If this will develop into a more common issue on our end I will discuss the possibility to share the data with you. I am closing this issue for now.