matthew-brett / xibabel

Piloting a new image object for neuroimaging based on XArray
BSD 2-Clause "Simplified" License
6 stars 0 forks source link

Consider netCDF and maybe nibabel reads from URL #9

Closed matthew-brett closed 5 months ago

matthew-brett commented 7 months ago

As part of the work on the https://github.com/matthew-brett/xibabel/tree/refactor-json-netcdf branch, I was experimenting with different ways to read netCDF files from URLs.

The first option I considered was using the new mode=bytes option. This proved horrible for reading, as in:

[ins] In [1]: import xibabel as xib

[ins] In [2]: import xibabel.testing as xit

[ins] In [3]: ximg = xib.load(xit.JC_EG_ANAT)

[ins] In [4]: xib.save(ximg, 'jc_eg_func.nc')

[ins] In [7]: from netCDF4 import Dataset

[ins] In [18]: %timeit np.array(Dataset('jc_eg_func.nc')['sub-07_T1w'])
58.5 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

[ins] In [19]: %timeit np.array(Dataset('file://./jc_eg_func.nc#mode=bytes')['sub-07_T1w'])
(no return after about 15 minutes).

I also observed seemingly random corruption of the read files using mode=bytes - but given the speed, and the unpredictability of that corruption, I won't try and nail that down here.

A possibly-related issue is this netCDF report.

Another option I considered as using fsspec. Investigations follow below.

matthew-brett commented 7 months ago

OK - fsspec looks much more plausible, along with h5netcdf.

First run a basic web server with Tornado:

from pathlib import Path
import asyncio

from tornado import web

def make_app():
    return web.Application([
        (r"/(.*)", web.StaticFileHandler, {"path": Path()})
    ])

async def main():
    app = make_app()
    app.listen(8899)
    shutdown_event = asyncio.Event()
    await shutdown_event.wait()

if __name__ == "__main__":
    asyncio.run(main())

Next:

import numpy as np

import xibabel as xib

import xibabel.testing as xit

ximg = xib.load(xit.JC_EG_ANAT)

xib.save(ximg, 'jc_eg_func.nc')

import h5netcdf

import xarray as xr

import fsspec

with fsspec.open('jc_eg_func.nc') as in_file:
    %timeit xr.open_dataset(in_file, engine='h5netcdf').compute()

with fsspec.open('file://./jc_eg_func.nc') as in_file:
    %timeit xr.open_dataset(in_file, engine='h5netcdf').compute()

port = 8899
with fsspec.open(f'http://localhost:{port}/jc_eg_func.nc') as in_file:
    %timeit xr.open_dataset(in_file, engine='h5netcdf').compute()

This gives:

32.5 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
32.3 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
182 ms ± 7.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That seems within the bounds of reason.

matthew-brett commented 5 months ago

This code now merged into #8