Speedup for large grids - mod gdf_pixels in create_raster_polgons

kerriegeil commented 2 years ago

In create_raster_polygons, the for loop that assigns individual polygons to gdf_pixels essentially renders xagg unusable for larger high res grids because it goes so slow. Here I propose elimination of the for loop and replacement with a lambda apply. Big improvement for large grids!

kerriegeil commented 2 years ago

It looks like the test failed because I've changed from Polygon to Polygon.from_bounds maybe. Let me change it back and see what happens

kerriegeil commented 2 years ago

Ok, I would think the tests should complete without error now. Are these something I can run?

ks905383 commented 2 years ago

Hi - yes, once I merge these and you're a contributor, you should be able to run these yourself. About to combine a bunch of different performance improvements (yours, and a few others on the aggregation side), so will block off a time to make sure everything interacts smoothly. (That's of course the point of the tests, but I want to be extra careful).

ks905383 commented 2 years ago

(actually you should be able to run them yourself now that you've contributed through the other PR)

ks905383 commented 2 years ago

Uh hm, this may actually be a problem with the test that's failing (test_create_raster_polygons_basic() in test_core.py)- specifically that the ds below has integer lat/lons and poly_test has float lat/lons. Could you run the tests locally on this branch, but replacing in the definition of ds the lat:(['lat'],np.array([0,1])) etc. calls with lat:(['lat'],np.array([0.,1.])) etc. ? If they pass, then I'll change that test up in the repo too and the PR should go through without a hitch.

# build raster polygons from a simple 2x2 grid of lat/lon pixels
ds = xr.Dataset({'lat_bnds':(['lat','bnds'],np.array([[-0.5,0.5],[0.5,1.5]])),
                 'lon_bnds':(['lon','bnds'],np.array([[-0.5,0.5],[0.5,1.5]]))},
                coords={'lat':(['lat'],np.array([0,1])),
                        'lon':(['lon'],np.array([0,1])),
                        'bnds':(['bnds'],np.array([0,1]))})
pix_agg = create_raster_polygons(ds)

# Create comparison geodataframe (what it should come out to)
poly_test = {'lat':[np.array(v) for v in [0.,0.,1.,1.]],'lon':[np.array(v) for v in [0.,1.,0.,1.]],
                    'geometry':[Polygon([(-0.5,-0.5),(-0.5,0.5),(0.5,0.5),(0.5,-0.5),(-0.5,-0.5)]),
                                  Polygon([(0.5, -0.5), (0.5, 0.5), (1.5, 0.5), (1.5, -0.5), (0.5, -0.5)]),
                                  Polygon([(-0.5, 0.5), (-0.5, 1.5), (0.5, 1.5), (0.5, 0.5), (-0.5, 0.5)]),
                                  Polygon([(0.5, 0.5), (0.5, 1.5), (1.5, 1.5), (1.5, 0.5), (0.5, 0.5)])],
                    'pix_idx':[0,1,2,3]}
poly_test = gpd.GeoDataFrame(poly_test,crs="EPSG:4326")

    # Use gpd assert (there's a check_less_precise option for "almost equal", 
    # but that doesn't seem necessary for this simple example)
    gpdt.assert_geodataframe_equal(poly_test,pix_agg['gdf_pixels'])

kerriegeil commented 2 years ago

Is there documentation somewhere on how to execute the tests? If so, I'd be more than happy to figure out how to run the tests locally with the mods to the test script.

ks905383 commented 2 years ago

Yes, absolutely.

You may have to install pytest into your current environment.

Once you've done that, you should be able to run the tests, since they were included when you forked the repository. To do so, just go to the top directory in your fork, and run python -m pytest tests/ in the terminal.

(the offending test from above is in tests/test_core.py)

Let me know if you run into any issues!

kerriegeil commented 2 years ago

Great, thanks! For some reason I assumed it was much more complicated than that. Easy enough.

Here are the potential fixes. Both pass all tests:

what you suggested for changing the int lat/lon coords to float in the ds in test_core.py AND ensuring that lat/lon in poly_test in test_core.py are dtype float (they are currently dtype object because lat/lon are sent as lists to gpd.GeoDataFrame) ds = xr.Dataset({... coords={'lat':(['lat'],np.array([0.,1.])), 'lon':(['lon'],np.array([0.,1.])), 'bnds':(['bnds'],np.array([0.,1.]))})

poly_test = {'lat':np.array([0.,0.,1.,1.]),'lon':np.array([0.,1.,0.,1.]), ...

or

leaving test_core.py unmodified and changing lat/lon in gdf_pixels in core.py to dtype object gdf_pixels['lat']=ds_bnds.lat.values.astype('object') gdf_pixels['lon']=ds_bnds.lon.values.astype('object')

2 feels a bit messy to me but it's up to you. I'll commit whichever you prefer

ks905383 commented 2 years ago

That's great! Thanks so much for tweaking that - honestly I'm a bit surprised that test hadn't come up as a problem before in some of my commits.

I think 1 may be more robust, so let's go with that one.

Thanks for your work on this! Super excited for the speed boost that's coming with v0.3.0

kerriegeil commented 2 years ago

No prob, will send in option 1 shortly!

ks905383 / xagg

Speedup for large grids - mod gdf_pixels in create_raster_polgons #30