uwmadison-chm / bioread

Utilities to work with files from BIOPAC's AcqKnowlege software
MIT License
65 stars 23 forks source link

UnicodeDecodeError with Python 2.7 #27

Closed benoitvalery closed 3 years ago

benoitvalery commented 4 years ago

Hi, When I try to read an .acq file, it returns a UnicodeDecoderError. Following what you suggested on issue #24, I'm using Python 2.7. The AcqKnowledge version is at least 4.x.

#!/usr/bin/env python
#-*- coding:utf-8 -*-

import bioread
data = bioread.read_file('test_github.acq')

gives me...

Traceback (most recent call last):
  File "convert_acq_to_txt.py", line 15, in <module>
    data = bioread.read_file('test_github.acq')
  File "/home/bvaler01/.local/lib/python2.7/site-packages/bioread/__init__.py", line 26, in read
    return reader.Reader.read(filelike, channel_indexes).datafile
  File "/home/bvaler01/.local/lib/python2.7/site-packages/bioread/reader.py", line 74, in read
    reader._read_headers()
  File "/home/bvaler01/.local/lib/python2.7/site-packages/bioread/reader.py", line 134, in _read_headers
    samples_per_second=self.samples_per_second)
  File "/home/bvaler01/.local/lib/python2.7/site-packages/bioread/biopac.py", line 40, in __init__
    self.channels = self.__build_channels()
  File "/home/bvaler01/.local/lib/python2.7/site-packages/bioread/biopac.py", line 92, in __build_channels
    self.channel_headers, self.channel_dtype_headers)
  File "/home/bvaler01/.local/lib/python2.7/site-packages/bioread/biopac.py", line 150, in from_headers
    name=chan_hdr.name,
  File "/home/bvaler01/.local/lib/python2.7/site-packages/bioread/headers.py", line 293, in name
    return self.data['szCommentText'].decode('utf-8').strip('\0')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 3: invalid start byte

Thanks for your help, and your work.

njvack commented 4 years ago

We're about to release version 2.0, which will only support python 3.6+. A reasonable amount of that is because handling unicode in both py2 and py3 suuuuuuuuuuucks.

So, this should be fixed and hopefully py3 is "everywhere" enough that this works for folks.

benoitvalery commented 3 years ago

Just updated bioread this morning to 2.0

pip install bioread==2.0

and tried to acq2txt test_github.acq but the error is quite the same

Traceback (most recent call last):
  File "/home/bvaler01/.local/bin/acq2txt", line 8, in <module>
    sys.exit(main())
  File "/home/bvaler01/.local/lib/python3.7/site-packages/bioread/runners/acq2txt.py", line 49, in main
    amr.run()
  File "/home/bvaler01/.local/lib/python3.7/site-packages/bioread/runners/acq2txt.py", line 68, in run
    data = bioread.read(infile, channel_indexes=channel_indexes)
  File "/home/bvaler01/.local/lib/python3.7/site-packages/bioread/__init__.py", line 27, in read
    encoding=encoding).datafile
  File "/home/bvaler01/.local/lib/python3.7/site-packages/bioread/reader.py", line 79, in read
    reader._read_headers()
  File "/home/bvaler01/.local/lib/python3.7/site-packages/bioread/reader.py", line 139, in _read_headers
    samples_per_second=self.samples_per_second)
  File "/home/bvaler01/.local/lib/python3.7/site-packages/bioread/biopac.py", line 43, in __init__
    self.channels = self.__build_channels()
  File "/home/bvaler01/.local/lib/python3.7/site-packages/bioread/biopac.py", line 113, in __build_channels
    self.channel_headers, self.channel_dtype_headers)
  File "/home/bvaler01/.local/lib/python3.7/site-packages/bioread/biopac.py", line 112, in <listcomp>
    for ch, cdh in zip(
  File "/home/bvaler01/.local/lib/python3.7/site-packages/bioread/biopac.py", line 171, in from_headers
    name=chan_hdr.name,
  File "/home/bvaler01/.local/lib/python3.7/site-packages/bioread/headers.py", line 306, in name
    return self.data['szCommentText'].decode(self.encoding).strip('\0')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 3: invalid start byte

except it is now python3.7

njvack commented 3 years ago

Can you send a copy of the file that doesn't work?

benoitvalery commented 3 years ago

Sure, here it is ! Probably an old one (2.x), but I have to convert it to .txt yet.

njvack commented 3 years ago

Hi again! Thank you for your patience — I'm actually poking around in this code today. I didn't save a copy of that file, and it's gone off that link. Can you post it again?

I suspect the code I'll commit this morning will fix it — I'm betting you have a file from an old-ish version of acqknowledge with a µ in the units or channel name?

njvack commented 3 years ago

I've released 2.1.0 which should, hopefully, fix this for real. Note that we're only officially supporting Python 3.6+ from here on out. Not saying things won't work with Python 2, but there's a good shot that they won't.

benoitvalery commented 3 years ago

Thanks for your update. Importing again raises an error, but the encoding error has gone.

Traceback (most recent call last):
  File "main.py", line 35, in <module>
    bioread.read(str(f))
  File "/home/benoit/.local/lib/python3.7/site-packages/bioread/__init__.py", line 26, in read
    return reader.Reader.read(filelike, channel_indexes).datafile
  File "/home/benoit/.local/lib/python3.7/site-packages/bioread/reader.py", line 79, in read
    reader._read_headers()
  File "/home/benoit/.local/lib/python3.7/site-packages/bioread/reader.py", line 142, in _read_headers
    samples_per_second=self.samples_per_second)
  File "/home/benoit/.local/lib/python3.7/site-packages/bioread/biopac.py", line 43, in __init__
    self.channels = self.__build_channels()
  File "/home/benoit/.local/lib/python3.7/site-packages/bioread/biopac.py", line 113, in __build_channels
    self.channel_headers, self.channel_dtype_headers)
  File "/home/benoit/.local/lib/python3.7/site-packages/bioread/biopac.py", line 112, in <listcomp>
    for ch, cdh in zip(
  File "/home/benoit/.local/lib/python3.7/site-packages/bioread/biopac.py", line 173, in from_headers
    fmt_str=dtype_hdr.numpy_dtype,
  File "/home/benoit/.local/lib/python3.7/site-packages/bioread/headers.py", line 447, in numpy_dtype
    return self.byte_order_char + self.CODE_MAP[self.type_code]
KeyError: -8192

Here is the file I'm trying to read. It's the same as previously. Thanks for your time.

njvack commented 3 years ago

Well. I'm completely baffled. It's definitely not reading the file correctly (no surprise there). I'm not at all sure why, though. It's like the first channel's header data is completely weird, and the second is okay-ish? That's weird, because if anything I would expect the opposite, but... hm

And then the data type headers are weird too, except neither looks totally plausible. But if we're reading the data out of the wrong part of the file that would happen. They're not totally bizarre though, which is even more surprising. It's not like the wrong byte order would cause this...

I'm not going to have the time and energy to debug this one, unfortunately. If I had a day or two and a copy of acqknowledge on-hand I could probably figure it out, but I can't devote that kind of time to this issue, and my copy of acqknowledge isn't somewhere I can get at it easily during the pandemic.

If you want to take a stab at fixing this yourself, let me know, but this is probably going to be pretty hard to fix. First step would be 100% confirming that this really, really, really does open in acqknowledge. It doesn't look like a totally horked-up file but the header stuff is also pretty weird.

For what it's worth, I believe this is acq 4.4 on macOS.

njvack commented 3 years ago

For an example of the weirdness:

Channel header 0: {'lChanHeaderLen': 40, 'nNum': 16363, 'szCommentText': b'T(\x04\xb5B\x80A"F+\x14\x99\x02\xf1@\xaeH\xdf\xa1\xd6\xcd\xf7@\x11\x9f\xf9\xedg\x0b\x93\x00\x01\x00\x00\x00\x00\x07"\x00\x02', 'notColor': (69, 67, 71, 0), 'nDispChan': 0, 'dVoltOffset': 0.0, 'dVoltScale': 0.0, 'szUnitsText': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'lBufLength': 2162690, 'dAmplScale': -0.004119873046875, 'dAmplOffset': 0.125, 'nChanOrder': 22127, 'nDispSize': 27764, 'unknown': b's\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\xb2\xe7?4\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x05\x8d', 'nVarSampleDivider': 0}
Channel header 1: {'lChanHeaderLen': 1826, 'nNum': 2, 'szCommentText': b'ECG\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'notColor': (0, 0, 0, 33), 'nDispChan': 2, 'dVoltOffset': -0.004119873046875, 'dVoltScale': 0.125, 'szUnitsText': b'Volts\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'lBufLength': 307943, 'dAmplScale': 0.00030517578125, 'dAmplOffset': 0.0, 'nChanOrder': 2, 'nDispSize': 1421, 'unknown': b'\x00\x00\x00\x00\xff\xff\x00\x00\xbf\xf7\x00\x00\x87\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x10\x00\x00\x00\x00\x00\x00', 'nVarSampleDivider': 1}
njvack commented 3 years ago

It's like channel 0 is totally wrong but channel 1 is normal?I don't even

benoitvalery commented 3 years ago

I'm actually processing .acq files for someone else, so I don't have Acqknowledge installed on my machine (plus I'm on Linux). I will ask about the version used. What's fun is that I have a serie of files to process, and only some are working... Just in case you want to take a look : here is an example of a file that is working and one which doesn't. If I do not need to have Acqknowledge to investigate, I may give it a try if you give me some guidelines. Thanks

njvack commented 3 years ago

Well, I don't know if it's any consolation, but both of the non-working files look like they're not working for similar reasons — both of them have okay data for the second channel (ECG) but really weird stuff for the first channel. If all three files are fundamentally similar, then it looks like the first channel is supposed to be Respiration. But it definitely isn't in datafile.acq. There, it looks like ECG is the first channel, and Resp is the second one, and that somehow it's reading the... hm. It's like there's an extra 40-byte header in between the graph header in one of the files but not the other. That's... very strange. It's mentioned in the file_structure pdf but I haven't actually seen it before.

You don't need acqknowledge to do this work, but it's very useful to be able to make sure the files really do open without error (maybe something bad happened to them?) and to be able to save the files as a different version to have things to compare.

Anyhow, if you want to investigate this yourself, you'll want a hex editor, a copy of the main branch of bioread (acq_info -d will print a ton of debugging information including file offsets where it's trying to read headers) and a lot of patience. This file format document will possibly help too, though that's as reverse-engineered as everything else here, so it may have errors of its own.

One other thing that might help is that AcqKnowledge can save files as previous versions (3.8.1, maybe?), so re-saving the files as that version might either let you compare and debug, or just hm

njvack commented 3 years ago

huh there's already a special handling for a 40-byte unknown header, this non-working file just... has... two of them??

njvack commented 3 years ago

Hm.......... okay. main is a little closer to being able to read these files. Turns out there's a 40-byte extra header after the graph header for some file versions; I'd just been skipping 40 extra bytes but that header can be repeated sometimes, it seems.

This fixes 02021, but datafile is still broken, it looks like the foreign data header for that one is wrong. Gaaahhhh.

Anyhow this is not what I should be working on today

benoitvalery commented 3 years ago

After having saved the non-working file with Acknowledge 5.0, it works like a charm. Actually, both working and non-working files seemed to have the same structure : Channel 2 was ECG, Channel 3 was Respiration. So it's not intellectually satisfying since I can't get why two files which look very similar don't go the same way in your program. A big thank for your extra-work, but I think I will go this way for now... just save files in a more recent version. It's faster and will save your efforts. Thanks.

njvack commented 3 years ago

That's super weird — I would have expected whatever oddness was in the file to have propagated up to the new version.

Anyhow, long story short: There's obviously something in the structure of the files that the software doesn't handle yet. (Very likely, there's more than one something.) Probably with another day of screwing around, I'll be able to figure that out. This has already led to one special case of "just skip forward 40 bytes!" getting excised out of the code, though, so I'll take that as a win.

It's probably worth fixing the weird-and-obviously-wrong handling of the foreign data header (though I don't think that'd fix the weird file yet); if I get around to it, I'll add that little change and call it a day.

Thanks for sharing these files! If you can upload the version 5 files somewhere (with similar names to the other ones!) it might be helpful in matching things up to figure out what's what.

For now I'm gonna close this issue.

njvack commented 3 years ago

Okay, I know I keep saying I'm gonna stop messing. But try 2.1.1; this handles all the files you've sent, I believe.

benoitvalery commented 3 years ago

Many thanks. I must admit I do not understand the whole thing but 2.1.1 fixes the problem. It now works on a dataset that contains actually 183 files, among which were the two files I uploaded. Here are the two files, in their version 5 (file_1, file_2). SO ! I'm now going to do some data processing !! Thanks again

njvack commented 3 years ago

I'm super glad it works! tl;dr: It's fixed but it's still a hack.

In case you care about how this got fixed: The files are binary data, with a bunch of header sections followed by a big old block of data. The headers have information about how long it is, how the channels are names, and what the format of the data is. The data starts right after the last header — there's no explicit "here's where to find the data!" information in the headers.

Anyhow, the structure of these headers isn't documented. It was, once, but things have changed since then. So we've had to figure out things by trial and error. Normally it's pretty obvious how long a header is (the first thing in a header is normally how long it is, and you can just find the next header right after) but sometimes that doesn't seem to be the case, and we figure it out by looking at the file and saying something like "well it looks like this header is 40 bytes longer than it says for this file version, so let's just read an extra 40 bytes" and that works for all the test files we have.

Then sometime later we get a different file and figure out "hey look there is actually a 40-byte header there and another field in the previous header that says how many of those to expect" or something like that. At least, we've always been able to figure it out so far. It's a lot of comparing headers from different files and seeing where a change looks likely to be. It's fun, actually, but it takes a fair bit of time.

In this case, there were three 40-byte headers where before I'd only seen one (I don't even know what this header is for at all), and the first time I'd seen one of the other headers with any data in it, in one of these files. That other header says how long it is, but it's 80 bytes short, for some reason, and I don't have a super clear idea of why. There might be something in one of the other headers that has it but I'm not sure where it is...

benoitvalery commented 3 years ago

Very interesting, you made it clearer to me. Just to be sure I understand, do you mean the header is impossible to decode ? Why is it impossible just to read and decode line by line and then, just stopping considering you are reading the header when you decode something that looks like data ? Is this parameter (header length) forced to be hard-coded ?

njvack commented 3 years ago

It's definitely not impossible to decode; the acqknowledge software certainly does it just fine 😄 We'd probably never have figured it out, except that Biopac published the format of a very old version of the files and we've been able to look at changes over time and keep up with it.

But it isn't like a text file where there are "lines," it's just... a big stream of bytes. Some of the bytes you interpret as navigation information, some of it it as data, interpretable based on the information you read in some of the other headers, and then there's other metadata (channel markers and journal in particular) after that. The data doesn't look special, it's just... more bytes. Some of the headers look "distinctive" — in the data file example you shared, the "how are the data channels formatted on disk" header look like 00 02 00 02 which is easy to spot by eye but that's unique to every arrangement of channels.

Reading these files is kind of like navigating a city with directions like "go straight 400 meters. Turn left, go 200 meters, read the sign next to you, it has three numbers that tell the direction you need to turn and the speed you'll be driving at next, and how long you'll be traveling at that speed. Turn that direction and travel at that speed for that amount of time. Read the sign next to you..."

And there are signs with numbers at literally every meter of ever road, so if you read the wrong sign at any point (sometimes the listed speed will be something like 12849817457) things go Very Wrong immediately.

Here's the code to read the headers if you want to see how it works in practice.

dgfitch commented 3 years ago

+1 to this analogy, also new versions of the format are like reorganizing the city, as well as the signs that point around it, without documenting what changed.

On Thu, Dec 3, 2020 at 11:08 AM Nate Vack notifications@github.com wrote:

It's definitely not impossible to decode; the acqknowledge software certainly does it just fine 😄 We'd probably never have figured it out, except that Biopac published the format of a very old version of the files and we've been able to look at changes over time and keep up with it.

But it isn't like a text file where there are "lines," it's just... a big stream of bytes. Some of the bytes you interpret as navigation information, some of it it as data, interpretable based on the information you read in some of the other headers, and then there's other metadata (channel markers and journal in particular) after that. The data doesn't look special, it's just... more bytes. Some of the headers look "distinctive" — in the data file example you shared, the "how are the data channels formatted on disk" header look like 00 02 00 02 which is easy to spot by eye but that's unique to every arrangement of channels.

Reading these files is kind of like navigating a city with directions like "go straight 400 meters. Turn left, go 200 meters, read the sign next to you, it has three numbers that tell the direction you need to turn and the speed you'll be driving at next, and how long you'll be traveling at that speed. Turn that direction and travel at that speed for that amount of time. Read the sign next to you..."

And there are signs with numbers at literally every meter of ever road, so if you read the wrong sign at any point (sometimes the listed speed will be something like 12849817457) things go Very Wrong immediately.

Here's the code to read the headers https://github.com/uwmadison-chm/bioread/blob/main/bioread/reader.py#L111 if you want to see how it works in practice.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uwmadison-chm/bioread/issues/27#issuecomment-738143947, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAV4GPNOYJPYMGR7M4DBDSS7AXLANCNFSM4LGLKKWA .

njvack commented 3 years ago

... and one of the specific changes in this version was to go from "well, it seems like the next signpost is 40 meters ahead" to "it looks like we want to go 120 meters ahead in this file; it turns out that one of the signposts a few meters back that said 1 in ever other file says 3 in that file, and the signs at meters 0, 40, and 80 each say 40, so... maybe that previously-ignored signpost is telling us how many of these other signposts to look for, and the other ones are telling us where to look for the next one?"

...which seems to work, though I still don't have any idea what's in those blocks of data or why there are three of them in this one file, or why converting up to version 5 seems to have made those extra blocks go away.

Woo!

At least they seem to not be changing things too much now, I don't think there are any specific version 5 things in the code at all.

For added fun, check out the code to read uncompressed data which Biopac's own documentation is all "don't bother trying to understand this, it's too complicated."

njvack commented 3 years ago

Okay. I am really, truly, going to be done with this! But I had a realization about this and it seems to have worked out!

The headers we're looking for in this case are a very specific format, so we can know if they're valid or not (they're two numbers; one of them has three possible values, one of them depends on the other) and there are exactly as many of them as there are channels — so the newest version of bioread works like this:

  1. Start right after the foreign data header
  2. Try to decode all the channel data type headers
  3. If they're all good, yay!
  4. Otherwise, skip forward one byte
  5. Goto 2, up to 4096 times.

This works on all the files I've seen so far (I had one other failing file from another person), and eliminates any of the "welp let's just go forward an extra 80 bytes!" hacks.