sigmf / sigmf-python

Easily interact with Signal Metadata Format (SigMF) recordings.
https://sigmf.org
GNU Lesser General Public License v3.0
42 stars 16 forks source link

Make memory mapped behavior match read_samples #60

Open Teque5 opened 3 months ago

Teque5 commented 3 months ago

When reading samples from signals the current implementation is a bit quirky and deviates from expectations when reading memory mapped samples from a file IF those samples need to be scaled.

Consider the case where we read the sigmf logo from the main repository. This is a 2-channel real-valued audio file with samples stored as 16-bit integers.

>>> logo = sigmf.sigmffile.fromfile('sigmf_logo')

>>> logo.read_samples(count=3)
array([[-3.0517578e-05,  0.0000000e+00],
       [ 6.1035156e-05,  0.0000000e+00],
       [-6.1035156e-05,  0.0000000e+00]], dtype=float32)

>>> logo[0:3]
memmap([[-1,  0],
        [ 2,  0],
        [-2,  0]], dtype=int16)

This happens because when using read_samples the scale factor is applied, but this is not done for the memory map.

I'm not sure the exact best solution for this, but I think we should fix #15 simultaneously since it will require tinkering with the same code.

Solutions I propose: 1) Leave as-is 2) When accesing the memory-map of a file that requires scaling, return of a copy of the data instead (by using read_samples probably) 3) When accessing a memory-map return a scale parameter along with the data? or maybe a warning?

Fixing #15 I believe requires using the offset kwarg of np.memmap.

liambeguin commented 3 months ago

Hi @Teque5, I've run into the same kind of problem with sigmf archives... I was hoping #42 was going to fix this, but no..

On my end the problem is that functions like read_samples_in_capture() assume that we have a data_file to access to run things like os.path.getsize(). IMO it would be really nice to rework/consolidate SigMFFile.__init__, set_data_file(), and _read_datafile() to process user inputs (either a file, a buffer, any other type, ...) into a single internal representation of the data (maybe _memmap?). Then, each accessor can use that single "representation" and return whatever is needed.

This might also help support loading a non-conforming dataset? Let me know what you think, I don't have a lot of time to spare on this, but I could try to help out.