syntalos / edlio

Experiment Directory Layout
https://edl.readthedocs.io
GNU Lesser General Public License v3.0
1 stars 3 forks source link

Error in reading Tsync file #1

Closed alejoe91 closed 4 years ago

alejoe91 commented 4 years ago

Hi @ximion

I'm having some issues in reading the Tsync file in the intan-signals folder. I'm trying to use edlio for the entire parsing, but I haven't found documentation on how to use the package, so here is my guess code:

syntalos_folder = 'path_to_syntalos_folder'

# load edl file
edlfile = edlio.load(syntalos_folder)

# get intan-signals dataset
intan_signals = io.dataset_by_name('intan-signals')

# retrieve tsync file
tsync = [t for t in intan_signals.read_aux_data()]

When doing this, I get the following error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-29-9ff7a2c3b3a9> in <module>
----> 1 for a in aux:
      2     print(a)

~/Documents/Codes/catalyst/lab_conversions/mease-lab-to-nwb/edlio/edlio/dataio/tsyncfile.py in load_data(part_paths, aux_data)
    183     ''' Entry point for automatic dataset loading '''
    184     for fname in part_paths:
--> 185         tsync = TsyncFile(fname)
    186         yield tsync

~/Documents/Codes/catalyst/lab_conversions/mease-lab-to-nwb/edlio/edlio/dataio/tsyncfile.py in __init__(self, fname)
     67         self._times = np.empty((0,3))
     68         if fname:
---> 69             self.open(fname)
     70 
     71     @property

~/Documents/Codes/catalyst/lab_conversions/mease-lab-to-nwb/edlio/edlio/dataio/tsyncfile.py in open(self, fname)
    140             self._time_created = datetime.utcfromtimestamp(ts)
    141             self._tolerance_us, = struct.unpack('<I', f.read(4))
--> 142             self._generator_name = read_utf8_bin_string(f)
    143 
    144             json_raw = read_utf8_bin_string(f)

~/Documents/Codes/catalyst/lab_conversions/mease-lab-to-nwb/edlio/edlio/dataio/tsyncfile.py in read_utf8_bin_string(f)
     47         raise Exception('String length in binary too long ({}).'.format(length))
     48 
---> 49     return str(f.read(length), 'utf-8')
     50 
     51 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 138: invalid continuation byte

I also tried to comment out this line, but it seems that read_utf8_bin_string is failing all the times. I tried to use a different encoding, but no luck yet. Any idea what's going wrong? Also tried instantiating the TsyncFile object directly with the same output.

Second question: am I using the API as it's intended? Or is there another way to access the different datasets/groups?

Thanks! Alessio

ximion commented 4 years ago

First, you are using the API as intended, although in most cases you can just use read_data() and the auxiliary information will be applied automatically. Your bug though smells like either data corruption, or, much more likely, you are using a dataset that was recorded with a really old version of Syntalos (after the tsync format was introduced, it was changed once just two weeks after it was added in an experimental version of Syntalos, and your issue really looks like you are trying to access the old format with the new parser). Could you attach the tsync file here?

alejoe91 commented 4 years ago

Hi @ximion

Thanks for the quick reply. I was expecting it might have been a compatibility issue. e3ea_data.tsync.zip Attached is the .tsync file (had to zip it because not supo=ported by github...).

I'd like to skip the read_data() because it reads everything tom memory. We use SpikeInterface to lazily read the .rhd file, so I would only be interested in the timestamps :)

Thanks again Alessio

alejoe91 commented 4 years ago

When I use the read_data() the timestamps are correctly loaded.

data = [d for d in intan_signals.read_data()][0]

print(data['t_aux_input'])

prints:

[0.00000000e+00 1.33333333e-04 2.66666667e-04 ... 9.15999600e+02
 9.15999733e+02 9.15999867e+02]

Are these timestamps in sync with the events table already?

ximion commented 4 years ago

Attached is the .tsync file

Yes, this is indeed an ancient file - I would not use it because at the time, synchronization also wasn't that great - so unless there are a lot of these files and sync issues exist, I wouldn't make the effort to use it.

I'd like to skip the read_data() because it reads everything tom memory. We use SpikeInterface to lazily read the .rhd file, so I would only be interested in the timestamps :)

Lazy-loading is on the todo list for edlio as well (actually, almost everything is already loaded on demand except for the RHD files).

Are these timestamps in sync with the events table already?

I think there is some misconception of what a tsync file actually is: These files operate in two modes, continuous and syncpoints. In the former case, you get an 1:1 mapping of timestamps, while in the latter case you only get specific points at which the time synchronization detected a divergence and set a point to resynchronize the two clocks. Older versions of the tsync format are always in the syncpoints mode, and versions used by the Intan synchronizer use that mode even with the newer file revision. So, you will have to use the synchronization points to do the synchronization yourself in postprocessing (ideally this would have been done in the Intan file format, but doing so would break every tool reading these modified RHD files as the change can't be done in a backwards-compatible way.)

I do have code for all of this, maybe I find the time to clean it up a bit and push it to edlio (in that case you would automagically just get the right timestamps when using read_data() on the Intan signals dataset).

alejoe91 commented 4 years ago

@ximion thanks for the clarification.

Also, I'm trying to interpret the tsync info. When I print tsync.times, I get:

array([[        0,  15000000,  14999708],
       [        1, 538914000, 538911829],
       [        2, 749378000, 749375267],
       [        3, 863042000, 863038872],
       [        4, 863670000, 863666749],
       [        5, 864298000, 864294734]])

I guess the first element is the sync event. Which one between the second and third elements are the resynchronizaiton points? To be more specific, for the first sync point ([0, 15000000, 14999708]), does it mean that the 14999708 clock frame should be resynchronized to 1500000, or vice versa?

Could you share with me the code to perform the synchronization from the tsync data?

ximion commented 4 years ago

To be more specific, for the first sync point ([0, 15000000, 14999708]), does it mean that the 14999708 clock frame should be resynchronized to 1500000, or vice versa?

TSync files contain quite a few annotation to help with "what was that data again?" questions, fortunately (I have to look these things up myself quite often). For example you can run print(ts.time_labels) to get a label for the first and second column. The first time is the secondary clock time, while the second time is the master clock time. The secondary clock timepoint should be aligned with the master timepoint and all subsequent times should be aligned by the current time offset unless another sync point is hit. This is implemented in edlio now as well - my implementation is still a bit crude (I extracted it straight from one of my analysis scripts), but at least a lot faster now even for very large datasets (4+ hours of recordings). The Python module abstracts all of the time synchronization stuff away, so if you want to you can completely ignore the tsync files and just call dset.read_data() to obtain data blocks matching the individual Intan file slices and then get the timestamps via the times_amplifier_ms key. The sync implementation is in read_rhd.py

I'll need to run a few more tests before I can update the PyPI module.

alejoe91 commented 4 years ago

Thank you so much @ximion!!!