harp-tech / protocol

Description of the Harp protocol.
https://harp-tech.org/protocol/BinaryProtocol-8bit.html
MIT License
3 stars 6 forks source link

Define a data logging/ingestion format and spec #41

Open bruno-f-cruz opened 8 months ago

bruno-f-cruz commented 8 months ago

Summary

One of the goals of the harp-ecossytem is to define data format and specifications to allow users to log their data in a stable and shareable format.

Current Implementations

At the Allen

The current implementation at the Allen follows the following pattern: https://allenneuraldynamics.github.io/Bonsai.AllenNeuralDynamics/articles/core-logging.html#harp-data

Essentially, all messages from a single device and GroupedBy Register and save in their respective binary file. The name of the binary file current follows the convention . e.g.:

├───Behavior.harp
│       Register__AnalogData.bin
│       Register__AssemblyVersion.bin
│       Register__Camera0Frame.bin
│       Register__Camera0Frequency.bin
│       Register__Camera1Frame.bin
│       Register__Camera1Frequency.bin
│       Register__ClockConfiguration.bin
│       Register__CoreVersionHigh.bin
│       Register__CoreVersionLow.bin
│       Register__DeviceName.bin
.....
├───ClockGenerator.harp
│       Register__AssemblyVersion.bin
│       Register__Battery.bin
│       Register__BatteryCalibration0.bin
│       Register__BatteryCalibration1.bin
│       Register__BatteryRate.bin
│       Register__BatteryThresholdHigh.bin
│       Register__BatteryThresholdLow.bin
│       Register__ClockConfiguration.bin
│       Register__Config.bin
│       Register__CoreVersionHigh.bin
│       Register__CoreVersionLow.bin
....

This has a few problems:

  1. it does not split by event/read/write. Which might be a problem given the last discussions about #37
  2. It does not work with the current spec of the harp-python package
  3. It does not include the yml metadata file making it difficult to recover the metadata associated with the device offline

Possible solutions

bruno-f-cruz commented 8 months ago

One thing that came to mind is why use the <DeviceName> to <UserGivenName>.harp / <DeviceName>_<RegisterNumber>.bin at all. It seems that it just introduces an extra dependency that is not necessary. Maybe a more general name, like Register is better? @glopesdev

glopesdev commented 8 months ago

@bruno-f-cruz This makes it easier when searching for chunks of the same device across epoch folders, as what happens in the Aeon data formats. I want to keep pushing for this, as I think it is an important use case to keep compatibility for, even though it may not be used in 90% of cases.

bruno-f-cruz commented 8 months ago

I guess my question is whether it should be part of the spec or not. From the Python interface point of view it doesn't appear to add much. I wonder if we can find a way that the interface works as long as the pattern is '*_' or if there is an advantage of introducing this dependency and locking the spec to it. To be clear: I am not against folding it in, just wonder if we really need to add it!

glopesdev commented 3 weeks ago

@bruno-f-cruz Picking the outstanding issues from this spec:

  1. it does not split by event/read/write. Which might be a problem given the last discussions about https://github.com/harp-tech/protocol/issues/37

Do we still need this now that harp-python explicitly exposes a message type column (https://github.com/harp-tech/harp-python/pull/11)?

  1. It does not work with the current spec of the harp-python package

If we agree changing the spec to use only register numbers then it should be fully compatible.

  1. It does not include the yml metadata file making it difficult to recover the metadata associated with the device offline

If we adopt the proposal in https://github.com/harp-tech/device.behavior/pull/21 then we will have a trivial way to store the metadata at acquisition time.

Proposed solutions

I think this last default is fine. There are questions of compatibility for projects like Aeon who want to go for multi-chunking of data and have possibly slightly different naming conventions for file layouts. This is fine because the standard folder structure is optional, i.e. it is always possible for projects and APIs to pass the data file path directly, so I don't think we necessarily need to worry too much about this as long as there is a reasonable way forward.

Assuming there is nothing else missing I think we are close to having a complete proposal for the data logging spec format that we could port into the issue description above, and discuss in the next Harp club meeting.