intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
76 stars 36 forks source link

Logic for remote versus local filepaths for RasterIOSource + GitHub Actions CI #82

Closed scottyhq closed 4 years ago

scottyhq commented 4 years ago

Working towards a fix for #81.

@martindurant and @d70-t , I'm pretty new to setting up test servers, so feel free to take over this PR. But I felt like I made a bit of progress today and wanted to open this up.

Currently my test_read_remote_tif function fails with E rasterio.errors.RasterioIOError: Range downloading not supported by this server!, which I'm not sure how to deal with. This seems to be a local setup issue because reading from github works fine:

with rasterio.open('https://raw.githubusercontent.com/intake/intake-xarray/master/intake_xarray/tests/data/RGB.byte.tif') as src:
    print(src.profile)
# {'driver': 'GTiff', 'dtype': 'uint8', 'nodata': 0.0, 'width': 791, 'height': 718, 'count': 3, 'crs': CRS.from_epsg(32618), 'transform': Affine(300.0379266750948, 0.0, 101985.0,
#       0.0, -300.041782729805, 2826915.0), 'tiled': False, 'interleave': 'pixel'}
d70-t commented 4 years ago

Thats most likely because the standard libraries http.server.SimpleHTTPRequestHandler doesn't support HTTP Range Requests which of course are extremely useful if you like to access only a subset of a remote file. There are two possible solutions: teach the server to handle range-requests or teach rasterio not to require them. I think both options are reasonable.

There are a couple of implementations for range requests on pythons SimpleHttpServer e.g in this gist or as a papckage. I did not test any of the two but had a quick look over the source, the implementation seems to be reasonable for a simple test-server. One thing which is missing seems to be the support for multipart ranges which would be a reasonable choice for requesting twodimensional subsets but I don't know if rasterio makes use of this feature.

On the other hand, maybe rasterio should fall back to use non-range-requests if the server doesn't support them. While this would potentially result in larger transfer sizes, that may still be better than failing for such servers.

scottyhq commented 4 years ago

Thanks for the info @d70-t . It looks like the test code in fsspec that @martindurant linked to also addesses the Range Request support (https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/tests/test_http.py#L39).

Ultimately, rasterio is using GDAL behind the scenes for reading and many environment variables exist to control behavior. https://gdal.org/user/virtual_file_systems.html, https://trac.osgeo.org/gdal/wiki/ConfigOptions, more info here https://github.com/OSGeo/gdal/issues/1295.

As an aside, it looks like CI is not running on this PR. If it appeals I could add the GitHub Actions setup we're using in Intake-stac to run pytest on linux? https://github.com/intake/intake-stac/blob/5d3af3abffcfbe409361da7e105aa69838cef2ff/.github/workflows/main.yaml

martindurant commented 4 years ago

Happy to have Actions run the tests, I have no idea why Travis has ceased. I tried to reactivate it now, and it says "There was an error while trying to activate the repository." (very useful).

martindurant commented 4 years ago

Note that can_be_local is in fsspec/master - it has not been released yet. Do you need a release to continue here?

scottyhq commented 4 years ago

That's ok, just going off the 37-upstream environment: https://github.com/intake/intake-xarray/pull/82/checks?check_run_id=1233000515 .

Not sure why test_discover[netcdf][zarr] are failing with E KeyError: 'datashape' ... those pass for me locally

I'm about to add a fix for allowing Range requests on intake_xarray/tests/test_remote.py::test_read_remote_tif

martindurant commented 4 years ago

datashape has been discontinued from Intake, because it was never used for anything - just remove it from the test.

scottyhq commented 4 years ago

OK! @martindurant - I couldn't figure out the context-manager approach for this data server setup (tests kept hanging and/or failing, seems like the server wasn't staying open or wasn't handling requests)

but I combined the fixture already in place for testing intake-server with @d70-t's suggestion of the RangeHTTPServer package on pypi. So the upstream test is now working!

Any additional changes and/or tests needed? One thing I can think of is to ensure the memory footprint is less than the local file size to verify lazy loading / just reading metadata.

I'd prefer to do any necessary changes to netcdf.py in a separate PR.

After merging this it would be great to release intake-xarray 0.3.3 and fsspec 0.8.4 since the raster.py now uses can_be_local and setup.py has fsspec>0.8.3

martindurant commented 4 years ago

verify lazy loading / just reading metadata

If you can think of a way, sure. I'm not sure how... You could read a file which has no real data and would error?

fsspec 0.8.4

Be sure to ping me early next week

scottyhq commented 4 years ago

Be sure to ping me early next week

Ping! A release would be great :)

I'll think about adding one additional test and making sure CI passes for this PR. I'm wondering if it's worth modifying the existing CI environments too? Specifically, instead of '36-build, 36-defaults-build, 37-nodefaults-build' how about:

- py36
- py37
- py38
- upstream 

with all dependencies unpinned?

martindurant commented 4 years ago

^ yes totally agree with that test matrix

martindurant commented 4 years ago

fsspec release is on pypi, v0.8.4

scottyhq commented 4 years ago

thanks @martindurant ! since i'm not a repo maintainer, i can't re-trigger the CI here https://github.com/intake/intake-xarray/actions/runs/303467233. Currently the CI environments (except for upstream) install fsspec from conda-forge, so we'll have to wait for it to become available there.

martindurant commented 4 years ago

conda-forge PR is merged, package should be available soon.

martindurant commented 4 years ago

@scottyhq , would you like to be a maintainer?

scottyhq commented 4 years ago

@scottyhq , would you like to be a maintainer?

If you need more hands on deck, feel free to add me, but I can't promise a lot of time commitment. My hands are pretty full as it is ;)

martindurant commented 4 years ago

The same for us all

martindurant commented 4 years ago

The package is ready and you have rights. You can merge this once things are passing.

scottyhq commented 4 years ago

OK, I think this is ready to go! Is is simple merge preferred here? or squash and merge?

It would be nice to do a minor release after merging. @martindurant - if you want I can add the pypi publish workflow as well, but that would require you to add pypi username info under the repository secrets: https://github.com/intake/intake-stac/blob/master/.github/workflows/pypipublish.yaml

martindurant commented 4 years ago

I can do the release. I think simple merge is fine for this repo.

scottyhq commented 4 years ago

I can do the release. I think simple merge is fine for this repo.

thanks again @martindurant and @d70-t for the guidance. Once intake-xarray 0.3.3 is released I'll be able to merge https://github.com/intake/intake-stac/pull/61 and get a new version of intake-stac out that takes advantage of these changes.

martindurant commented 4 years ago

intake-xarray release merged and building on conda-forge now

martindurant commented 4 years ago

I seem to have called it 0.4.0