microsoft / yardl

Tooling for streaming instrument data
https://microsoft.github.io/yardl/
MIT License
31 stars 5 forks source link

Reading subsets of streams from files without iterating over everything #172

Open fdellekart opened 1 month ago

fdellekart commented 1 month ago

Hello,

I am investigating different toolboxes for PET listmode reconstruction and was trying out PyTomography, which supports PETSIRD as its input format, which in turn uses yardl for the model definition.

Specifically, I am looking into dynamic (frame by frame) reconstruction of the listmode data. Furthermore, for testing PyTomography I tried to limit data input to a subset of a few seconds acquisition time, to try things out and not waste time loading and processing my full dataset (~10GB, ~1h+ acquistion time) before I know that things actually work.

PyTomography does currently not support specifying concrete time intervals to use from listmode data with a longer acquisition time. It allows to specify timeblock IDs, however, it still iterates over all the timeblocks, filtering out the ones with the specified IDs, which is very inefficient (see here).

Therefore, I dug a bit deeper and tried to figure out if I can adapt the toolbox in a way which allows me to read only certain timespans from the PETSIRD file. After seeing that the protocol readers use streams, I was analyzing the binary structure of the protocols to maybe find a way to calculate the size of the data and then seek the correct position in the file where I'd like to read data.

I found out that vector and array types store their length as the first part of the data in binary form. Therefore, AFAICT what I want to achieve isn't possible and I would have to iterate over all the timeblocks because I can't know the length of the event vectors stored inside them upfront.

Is there another way of achieving what I am trying to? Maybe I am missing/misunderstanding something also, please let me know if this should be the case :slightly_smiling_face:

If it really is not possible, IMO it would be a useful feature to consider. Being forced to read the full file when interested in part of it is cumbersome IMO. Also for dynamic reconstruction it could be beneficial, as it wouldn't be necessary to wait for loading of all frames before the first one can be processed.

However, I am not really far into the scope of this project, for transmitting the same stream of bytes over a network this would not make any sense I guess, so let me know if this just is not intended. :+1:

PETSIRD protocol definition can be found here.

Thanks in advance and best regards, Florian

naegelejd commented 2 weeks ago

Hi Florian,

You are correct that this is not yet possible in Yardl, however, this is one of the next features we plan to add.

Also, perhaps I misunderstand your meaning of "dynamic reconstruction", but you certainly don't have to wait to load a full dataset before processing it. Yardl enables processing your data as you read it from the stream (file, network, etc.). You can find examples of this for the MRD format here.

fdellekart commented 3 days ago

Thank you for your reply.

As mentioned above by "dynamic" I am referring to a frame by frame reconstruction. I am currently part of a neuroscience group which is using fPET. So we are reconstructing consecutive frames with a few seconds acquisition time to investigate if certain stimuli do cause a response in the PET signal.

I was looking for a quick solution to get PyTomography working because I think it has an interesting implementation approach for reconstruction which I didn't want to disregard only because of some minor problems.

I am aware that I don't have to load the full dataset, however, in my case there's no activity in the first frames so I would have wanted to jump somewhere into the middle of acquisition. Continuing the stream from where the last frame ended would help but require some major changes inside PyTomography as it does currently not support this kind of reconstruction and I was looking for a quick fix but disregarded it from my analysis for the time being.

For this kind of dynamic reconstruction it would be beneficial to jump somewhere into the stream to reconstruct single frames for debugging and evaluation but it is surely an edge case which does not come up in many applications.

BR, Florian