Wrong flow accumulation with efficiency grids in test_grid.py

mdbartos / pysheds

:earth_americas: Simple and fast watershed delineation in python.

GNU General Public License v3.0

708 stars 191 forks source link

Wrong flow accumulation with efficiency grids in test_grid.py #168

Closed itati01 closed 2 years ago

itati01 commented 2 years ago

Hi @mdbartos ,

I installed pysheds 0.3 and tested the flow accumulation with the two simple effiency grids: one cell is 0.25 and all others nodata. The grid extents differ. I used a modified version of test_grid.py to save the flow accumulation at the end of the function, and to test the clipping.

The efficiency value for D8 is located outside the catchment used for the clipping. Nonetheless, the D8 flow accumulation with efficiency is a) not clipped and b) contains many nodata and negative values. I glimpsed at major streams. They look like acc1 -acc1 acc2 -acc2. The results should be identical to the flow accumulation without the efficiency.

I suggest moving grid.clip_to(catch) before acc_d8_eff = grid.accumulation(fdir, dirmap=dirmap, efficiency=eff, routing='d8') to fix the first but not the second issue. By the way, acc_d8 is computed twice.

The clipped Dinf outputs are visually plausible. However, without the clipping, the pattern outside the extent of the efficiency raster is also wrong, and values can be extremely high. What about rejecting inconsistent extents, or assuming no weighting for grid cells outside the efficiency extent? Or do I miss something?

Best Andreas

mdbartos commented 2 years ago

Hi Andreas,

Thank you for checking into this. In v0.3, the behavior is to automatically 'view' the fdir, weights and efficiency rasters at the same spatial extent:

https://github.com/mdbartos/pysheds/blob/9d8896080fdf9d9af88a57f203177ba1a365946c/pysheds/sgrid.py#L786-L794

Where _input_handler essentially just calls self.view.

Could you post the code you used to produce the bug?

itati01 commented 2 years ago

Sure. In line 9, I changed __file__ to a string because I used Spyder, i.e. current_dir = os.path.dirname(os.path.realpath("repos/pysheds_devel/pysheds/tests/test"))

I only slightly modified test_accumulation (without my instead of the original comments here):

def test_accumulation(use_clip=True):
    fdir = d.fdir
    eff = d.eff
    # condition to turn off clipping
    if use_clip:
        catch = d.catch
    fdir_d8 = d.fdir_d8
    fdir_dinf = d.fdir_dinf
    # condition to turn off clipping
    if use_clip:
        grid.clip_to(fdir)
    acc = grid.accumulation(fdir, dirmap=dirmap, routing='d8')
    assert(acc.max() == acc_in_frame)
    eff = grid.view(eff)
    eff[eff == eff.nodata] = 1
    # should be below grid.clip_to(catch)
    acc_d8_eff = grid.accumulation(fdir, dirmap=dirmap,
                                   efficiency=eff, routing='d8')
    # condition to turn off clipping
    if use_clip:    
        grid.clip_to(catch)
    c, r = grid.nearest_cell(x, y)
    acc_d8 = grid.accumulation(fdir, dirmap=dirmap, routing='d8')
    assert(acc_d8[r, c] == cells_in_catch)
    # what is the reason to repeat the calculation?
    acc_d8 = grid.accumulation(fdir_d8, dirmap=dirmap, routing='d8')
    assert(acc_d8.max() > 11300)
    acc_dinf = grid.accumulation(fdir_dinf, dirmap=dirmap, routing='dinf')
    assert(acc_dinf.max() > 11300)
    eff = grid.view(dinf_eff)
    eff[eff==dinf_eff.nodata] = 1
    acc_dinf_eff = grid.accumulation(fdir_dinf, dirmap=dirmap,
                                     routing='dinf', efficiency=eff)
    d.acc = acc
    # save files for comparison in QGIS
    d.acc_d8 = acc_d8
    d.acc_dinf = acc_dinf
    d.acc_d8_eff = acc_d8_eff
    d.acc_dinf_eff = acc_dinf_eff
    Grid.from_raster(d.acc_d8).to_raster(d.acc_d8, os.path.join(data_dir, 'test_d8_acc__.tif'))
    Grid.from_raster(d.acc_dinf).to_raster(d.acc_dinf, os.path.join(data_dir, 'test_dinf_acc__.tif'))
    Grid.from_raster(d.acc_d8_eff).to_raster(d.acc_d8_eff, os.path.join(data_dir, 'test_d8_acc_eff__.tif'))
    Grid.from_raster(d.acc_dinf_eff).to_raster(d.acc_dinf_eff, os.path.join(data_dir, 'test_dinf_acc_eff__.tif'))

Finally, I simply call the test functions to calculate the flow direction and accumulation.

# change to True to turn on the clipping
use_clip = False
if use_clip:
    test_catchment()
    test_clip()
test_fill_depressions()
test_resolve_flats()
test_flowdir()
test_dinf_flowdir()
if use_clip:
    test_clip_pad()
test_accumulation(use_clip=use_clip)

The problem with Dinf and use_clip = False is that the extent of diff_eff.tif is too small (for the view): d.dinf_eff.shape is (161, 171) and d.eff.shape is (359, 367) which is identical to grid.shape.

mdbartos commented 2 years ago

On the numba branch I'm finishing up the process of converting recursive functions to their iterative counterparts. Right now, accumulation + efficiency is throwing segfaults, which probably indicates that it's trying to access memory that lies out-of-bounds.

I have to take care of some other stuff over the next couple of days. If you figure out the problem, let me know 👍

itati01 commented 2 years ago

No solution for the D8 issue but a suspicion. d.eff.nodata and d.dinf_eff.nodata are both -1. However, for some reason the flow accumulation seems to treat efficiency = -1 literally not as nodata = 1 (seemingly unlike Dinf). That's why we get sequences of +acc1 -acc2... and many -1's, 0's and 1's in between. For an example, try print(acc_d8_eff[300:320,60:80]) (or plt.imshow(acc_d8_eff[300:320,60:80], extent=grid.extent, zorder=2, cmap='cubehelix', norm=colors.LogNorm(1, acc_d8_eff.max()), interpolation='nearest'). The negative values are in white because of the normalisation.).

acc_d8_eff_subset

Skimming over the code I suspect that test_accumulation() is the problem. eff.nodata is 255 not -1, so nodata cells are not set to 1 but remain -1:

print(eff.nodata)    # -1.0
eff = grid.view(eff)
eff[eff == eff.nodata] = 1  # eff.nodata == 255

The grid.view(eff) seems to be the culprit. Initially, grid.view(d.eff).nodata is -32768. After test_flowdir(), it is 255, and after test_clip_pad() False. However, why is it again 255 in test_accumulation()?

print(grid.view(d.eff).nodata) # -32768
test_constructors()
test_dtype()
test_nearest_cell()
use_clip = True
print(grid.view(d.eff).nodata)
if use_clip:
    test_catchment()
    test_clip()
    print(grid.view(d.eff).nodata)
test_fill_depressions()
print(grid.view(d.eff).nodata)
test_resolve_flats()
print(grid.view(d.eff).nodata)
test_flowdir()
print(grid.view(d.eff).nodata)  # 255
test_dinf_flowdir()
print(grid.view(d.eff).nodata)
if use_clip:
    test_clip_pad()
    print(grid.view(d.eff).nodata)  # False
test_accumulation(use_clip=use_clip)
print(grid.view(d.eff).nodata)

So, probably the nodata value should be taken from the raster, not the view? By the way, I would suggest to handle nodata values in the accumulation function, similar to fdir[nodata_cells] = 0 in _d8_accumulation().

mdbartos commented 2 years ago

A couple notes:

For d-infinity routing, nodata cells get handled when angles are converted to D8 directions under _angle_to_d8_numba:

https://github.com/mdbartos/pysheds/blob/9d8896080fdf9d9af88a57f203177ba1a365946c/pysheds/_sgrid.py#L220-L226

The reason that the nodata value is being converted to 255 is because that is the nodata value of the target viewfinder---which in this case is the viewfinder of the grid. See the View.view method:

https://github.com/mdbartos/pysheds/blob/9d8896080fdf9d9af88a57f203177ba1a365946c/pysheds/sview.py#L445

You can override the nodata value in the grid.view method by passing it in as a keyword argument (this also applies to other parameters of the viewfinder as well):

eff_view = grid.view(eff, nodata=-1)

grid.accumulation performs this override automatically, so you do not need to do it outside the function:

https://github.com/mdbartos/pysheds/blob/9d8896080fdf9d9af88a57f203177ba1a365946c/pysheds/sgrid.py#L791-L794

You can fix part of the problem by removing the call to grid.view(eff) before the call to accumulation---this will get done anyway inside accumulation.
The fact that grid.nodata was set to the nodata value of fdir after calling grid.clip_to(fdir) was a surprise to me. In some ways, it makes sense, but I am not sure this is really the desired behavior.

mdbartos commented 2 years ago

Note that in v0.3, the max accumulation values are slightly higher, because the algorithm traverses all the way to the edges of the dataset (whereas previously the edges were not considered).

For the efficiency test, I'm now getting:

>>> acc_d8_eff.max()
Raster(77257.)

>>> acc_d8_eff[acc == acc.max()]
Raster([19318.25])

The latter of which is roughly 1/4 of the max accumulation.

Is there a way to get the efficiency test values from first principles? For instance, for the accumulation test, I now test that the accumulation at a given cell is equal to the number of nonzero cells in the catchment delineated at the same cell. Can a similar test be done for efficiency?

mdbartos commented 2 years ago

@itati01 : Based on your feedback, I've changed the default behavior of view. Using the new inherit_nodata keyword argument, view will now inherit the nodata value from the input data by default: https://github.com/mdbartos/pysheds/pull/170/commits/8eb61678b0535d1ddda7c5fc3f818cc0978edefb

Let me know if you can devise a new test for the transport efficiency. It would be best if the test can be derived from some known relation (as mentioned in my previous post).

itati01 commented 2 years ago

I will test your code change these days. I have two ideas for a test but did not test them.

set efficiency for all grid cells to eff>0, so acc_d8_eff = eff ** (cells_upstream) * acc_d8 (for D8 cells_upstream = acc_d8 - 1)

set efficiency for start cells to 0, the flow accumulation with efficiency at the outlet should be flow acc - number of start cells:

# flow accumulation within catchments is >= 1, outside catchment == 0
# number of start cells
start_cells = acc_d8[acc_d8==1].sum()
# efficiency is 0 for start cells within catchment and cells outside catchment, and 1 for all others
eff_ = (acc_d8>1].astype(np.uint8)
acc_d8_eff = grid.accumulation(fdir_d8, dirmap=dirmap, routing='d8', efficiency=eff_)
assert(acc_d8.max() - acc_d8_eff.max() == start_cells)

Both approaches should work with D8 but likely not Dinf where we have to consider the splitting of flow routes. e^n in the first case will quickly become pretty low, so I am not sure about an assertion for the current catchments. What do you think?

itati01 commented 2 years ago

Thanks for fixing this issue. All test images look plausible now (as long as the nodata in the efficiency grids is explicitly set to 1 in test_accumulation()). However, I suggest some changes in test_accumulation (the use_clip argument could be removed).

# use_clip == True is the default behaviour
def test_accumulation(use_clip=True):
    fdir = d.fdir
    eff = d.eff
    acc = grid.accumulation(fdir, dirmap=dirmap, routing='d8')
    # CHECK: I removed the assertion as it is not working after the code change
    # assert(acc.max() == acc_in_frame)

    # new: the clipping should be done here
    # to ensure that the D8 accumulation with efficiency is clipped to the catchment boundaries
    if use_clip:
        # probably not needed
        grid.clip_to(fdir)
        catch = d.catch
        grid.clip_to(catch)
    # I moved this below the clipping and
    # changed d.xxx to grid.view(d.xxx) to get the clipped flow directions
    fdir_d8 = grid.view(d.fdir_d8)
    fdir_dinf = grid.view(d.fdir_dinf)
    # nodata has to be set to 1
    # should be done in grid.accumulation()
    eff = grid.view(eff)
    eff[eff == eff.nodata] = 1
    # new assertion: identical shapes of the arrays
    assert(fdir_d8.shape==fdir_dinf.shape==eff.shape)
    # new: changed fdir to (clipped) fdir_d8
    acc_d8_eff = grid.accumulation(fdir_d8, dirmap=dirmap,
                                   efficiency=eff, routing='d8')
    # original code without comments
    c, r = grid.nearest_cell(x, y)
    acc_d8 = grid.accumulation(fdir, dirmap=dirmap, routing='d8')
    assert(acc_d8[r, c] == cells_in_catch)
    acc_d8 = grid.accumulation(fdir_d8, dirmap=dirmap, routing='d8')
    assert(acc_d8.max() > 11300)
    acc_dinf = grid.accumulation(fdir_dinf, dirmap=dirmap, routing='dinf')
    assert(acc_dinf.max() > 11300)
    # again, we need to explicitly set nodata to 1
    eff = grid.view(dinf_eff)
    eff[eff==dinf_eff.nodata] = 1
    acc_dinf_eff = grid.accumulation(fdir_dinf, dirmap=dirmap,
                                     routing='dinf', efficiency=eff)
    d.acc = acc
    # store rasters in d...

The D8 efficiency == 0.25 is outside the test catchment, so with use_clip==True, acc_d8 == acc_d8_eff. The latter is correctly clipped to the catchment. With use_clip==False, the efficiency is correctly reducing the flow accumulation. Note: The proposed code change affects the flow direction. As a consequence, the D8 efficiency == 0.25 (d.eff[38, 365] is just off the main stream (at [37, 365]), so its effect on the flow accumulation is very small.

itati01 commented 2 years ago

I had a look into the relative shift of one grid cell between the efficiency and the flow direction grids in the test case. My proposed code above uses d.fdir_d8 for the flow accumulation instead of d.fdir. The flow directions are shifted which explains the unexpected relative shift of the efficiency grid to the flow direction. While d.fdir is read from an external file, d.fdir_d8 (and d.fdir_dinf) are derived from the DEM. I expected that the D8 files are identical but they are not. This seems to be an issue in the calculation of the flow direction. Or is it a wrong return value (the 0's are probably needed to avoid index errors) or is it an issue in the test script? Note that d.fdir_dinf has a border of nan's.

At a first glance, fdir_d8 is like fdir but with surrounded 0's. The first rows and columns are shifted by +1 (e.g. first column 218... below corresponds to the second column 021... above). However, the last ones show interesting deviations. The first rows in the second to last column above (012) correspond to the last column below (112). So, the original second to last column vanished (242...). Sorry for the incorrectly formatted arrays. It would be good to check for e.g. deviating streams.

# to be executed within test_accumulation()
print(fdir_d8.shape, d.fdir_d8.shape, fdir_d8, d.fdir)
(359, 367) (359, 367)

 [[  0    0    0 ...    0     0   0]
  [  0    2    2 ...    4     1   0]
  [  0    1    2 ...    4     2   0]
  ...
 [  0  64  32 ...    8     1   0]
 [  0  64  32 ...  16 128   0]
 [  0    0    0 ...    0     0   0]] 

 [[   2    2    4 ...    4   2     1]
 [   1    2    2 ...    4   4     1]
 [   8    1    2 ...    4   2     2]
 ...
 [  64  64  32 ...    8   1 128]
 [  64  64  32 ...  16   2     1]
 [128  64  64 ...  16   4     8]]

mdbartos commented 2 years ago

The zeros along the edges are there because there is no way to know the flow direction for certain in these cases (same goes for the nans in dinf). In older versions, I allowed the flow directions for these edge cells to be computed based on the known neighbor elevations. However, I eventually decided that leaving them unknown is probably the 'correct' behavior. I believe ArcGIS has an option to force edge cells to point outwards, which I could add as an option.

Note that the D8 flow directions for a given DEM are not unique. Say that a cell is surrounded by lower cells that are all at the same elevation. Pysheds will effectively select the 'first' of these lowest cells in determining the flow direction:

https://github.com/mdbartos/pysheds/blob/9d8896080fdf9d9af88a57f203177ba1a365946c/pysheds/_sgrid.py#L23-L32

I am not totally sure about the provenance of the dir.asc file. I do not know whether it was generated from the DEM, or whether it was taken from the HydroSHEDS flow direction collection. The main purpose is to test reading from an ascii text formatted grid. It would be better to single-source these tests though (e.g. have all the tests based on the DEM file).

itati01 commented 2 years ago

Thanks for the explanation. If fdir.asc was calculated differently, then deviations are understandable. The zeros along the edges are clear from the linked code. So, it could be just a coincidence with the "shift" I suspected. Maybe it would be good to modify the efficiency grid, i.e. apply an efficiency to a grid cell with a large flow accumulation? Just to make its impact stronger.

mdbartos commented 2 years ago

Hi @itati01, let me know if there is anything you need from me on this issue. Feel free to submit a PR if you want to improve/fix the testing of accumulation efficiency---you probably understand it much better than I do.

Note that I just released v0.3.1 yesterday, so take care to update if submitting a PR.

itati01 commented 2 years ago

Hi @mdbartos I just submitted a PR. By the way, pytest gave a few deprecation warnings for which I will open an issue. Shall we close this comment?

mdbartos commented 2 years ago

Will do.