Provide an option to return a local file without computing md5 sum

int-brain-lab / ONE

Open Neurophysiology Environment

MIT License

17 stars 5 forks source link

Provide an option to return a local file without computing md5 sum #99

Closed oliche closed 1 month ago

oliche commented 1 year ago

For the spike sorting loader, we are interested in locating the raw binary file associated with the insertion if it is present on the disk, and returning None if it isn't.

Right now we've implemented it this way:

    @property
    def ap_file(self):
        """Gets path to AP .cbin file for this spike sorting session"""
        dsets = self.one.list_datasets(
            self.eid, filename='_spikeglx_*.ap.cbin', collection=f'raw_ephys_data/{self.pname}', details=True)
        files = self.one._check_filesystem(dsets, offline=True)
        if len(files) > 0:
            return files[0]
        else:
            return None

The issue is that the method computes the md5 sum which is very long. It also attempts to download the file in some cases.

Is there a better way to get the local file if it exists using ONE syntax ? Maybe a compute_md5=True optional flag ?

In the _check_filesystem private method https://github.com/int-brain-lab/ONE/blob/50b1747d82d8900e1a061e3e88100b66f0ced601/one/api.py#L526

chris-langfield commented 1 year ago

Given the changes in ibl-neuropixel (the option to provide cbin, meta, and ch files individually to the Reader for compatibility on SDSC), maybe we want a function that will return all three? Here's an example of what I've been using:

def get_spikeglx_files(pid, one, band):
    eid, probe = one.pid2eid(pid)
    files = []
    for suffix in ["cbin", "meta", "ch"]:
        dsets = one.list_datasets(eid=eid, collection=f"raw_ephys_data/{probe}", filename=f"*.{band}.{suffix}", details=True)
        files.append(one._check_filesystem(dsets, offline=True)[0])
    return files

k1o0 commented 1 year ago

https://github.com/int-brain-lab/ibllib/commit/d92afb8659c422ebc419cf0f62f9f98f6279969d returns the AP file without a hash check. This could have also been solved by adding dsets['hash'] = None (list_datasets always returns a copy), however the chosen solution avoids the use of private methods altogether.

@chris-langfield likewise you can use record2path to get the file names for all spikeglx file:

def get_spikeglx_files(pid, one, band):
    eid, probe = one.pid2eid(pid)
    dsets = one.list_datasets(eid, filename=f'_spikeglx_*.{band}.*', collection=f'raw_ephys_data/{probe}', details=True)
    files = list(filter(Path.exists, map(one.record2path, map(lambda x: x[1], dsets.iterrows()))))
    # Sort the files and replace missing ones with None
    return [next((f for f in files if f.suffix == x), None) for x in ('.cbin', '.meta', '.ch')]

Make sure you pull the latest changes on the ONE uuidFilenames branch

chris-langfield commented 1 year ago

thanks @k1o0!

k1o0 commented 1 month ago

Resolved by https://github.com/int-brain-lab/ONE/commit/78765f634b0b47ae0aa9709598054447d3726b06.