BUG: Tessellation buildings match & MultiPolygon elements and related spatial_weight inconsistence

thomleysens commented 4 years ago

Python 3.7
Anaconda virtual env
momepy v0.1 (*via conda install*)
Ubuntu 18.04

Hi,

Thanks for your great library and its illustrated documentation.

I have this UserWarnings when I run momepy.Tessellation on GeoDataFrame with buildings from OpenStreetMap:

path/to/momepy/elements.py:424: UserWarning: Tessellation does not fully match buildings. 1 element(s) collapsed during generation - unique_id: {57688}
  "during generation - unique_id: {i}".format(len=len(diff), i=diff)

path/to/momepy/elements.py:434: UserWarning: Tessellation contains MultiPolygon elements. Initial objects should be edited. unique_id of affected elements: [18549, 23745, 23862, 27326, 27938, 30778, 39740, 44778, 51021, 51645, 55799, 56419, 56887, 57246, 57325, 57332, 57333, 58512, 58629, 58970, 59236, 59993]
  "unique_id of affected elements: **{}".format(list(uids))**

I understand what's wrong but do not understand what I must change on initial objects. What are the initial objects ? The buildings within the input GeoDataFrame ? Because it don't seem to contain MultiPolygons when I check the instance of the geometries.

For now, I found a workaround by adding a try except in the NeighbourDistance in momepy.distribution in order to get momepy.NeighborDistance working.

    def __init__(self, gdf, spatial_weights, unique_id):
        self.gdf = gdf
        self.sw = spatial_weights
        self.id = gdf[unique_id]
        # define empty list for results
        results_list = []

        data = gdf.set_index(unique_id)

        # iterating over rows one by one
        ##TODO: REMOVE THE ADDED TRY EXCEPT TO AVOID ERRORS
        for index, row in tqdm(data.iterrows(), total=data.shape[0]):
            try:
                neighbours = spatial_weights.neighbors[index]
                building_neighbours = data.loc[neighbours]
                if len(building_neighbours) > 0:
                    results_list.append(
                        np.mean(building_neighbours.geometry.distance(row["geometry"]))
                    )
                else:
                    results_list.append(0)
            except:
                results_list.append(0)

        self.series = pd.Series(results_list, index=gdf.index)

Do you have any idea on what I could change/modify to get NeighbordDistance working without this workaround ?

martinfleis commented 4 years ago

Hi @thomleysens,

UserWarning in your case during Tessellation says that 1 element(s) collapsed during generation, which means that building with unique_id 57688 is so small, that during the inward offset it collapses into itself generating empty geometry.

The second UserWarning tells you, that tessellation contains MultiPolygons, not buildings. That is very often caused by overlapping building polygons or other kinds of imprecision. Which is quite common in OSM. The list of IDs matches building unique_id do you can check which buildings are causing it and edit them manually.

What are the initial objects ? The buildings within the input GeoDataFrame?

Yes, buildings. The most convenient way of checking it is saving gdf with unique_id and opening it in QGIS (or similar). You generally need to edit the geometry manually.

Can you tell me the error momepy.NeighborDistance is raising?

Ideally if you can provide some reproducible example, so I can debug it, that would be great.

martinfleis commented 4 years ago

@thomleysens I think I know where the issue comes from in momepy.NeighborDistance. Can you show how did you generate spatial_weights?

thomleysens commented 4 years ago

Thanks for your explanations.

This is how I proceed to get buildings, clean them and write them Here is the input geojson file study_field.zip

.

import geopandas as gpd
import osmnx as ox
from shapely.geometry import Polygon, MultiPolygon

study_area = gpd.read_file("./data/study_field.geojson").to_crs(epsg=4326)
bldgs_osm = ox.footprints.footprints_from_polygon(
    study_area["geometry"][0]
)

#Keep only a few columns
bldgs_osm = bldgs_osm[["name", "geometry"]]

#Delete non valid geometries
bldgs_osm["valid"] = bldgs_osm["geometry"].is_valid
bldgs_osm = bldgs_osm.loc[bldgs_osm["valid"] == True]
bldgs_osm["empty"] = bldgs_osm["geometry"].is_empty
bldgs_osm = bldgs_osm.loc[bldgs_osm["empty"] == False]

#Drop useless "valid" column
bldgs_osm.drop(["valid","empty"], inplace=True, axis=1)
bldgs_osm.to_file("./data/buildings_osm.gpkg", layer="buildings", driver="GPKG")

Then I run the preprocess

import momepy

bldgs_osm = gpd.read_file("./data/buildings_osm.gpkg", layer="buildings")
bldgs_osm_projected = ox.project_gdf(bldgs_osm) #seems to be needed to avoid errors in tessallation part (see doc)
bldgs = momepy.preprocess(
    bldgs_osm_projected, 
    size=30,
    compactness=True, 
    islands=True
)

Then the tessellation

bldgs['uID'] = momepy.unique_id(bldgs)
limit = momepy.buffered_limit(bldgs)
tessellation = momepy.Tessellation(
    bldgs, 
    unique_id='uID', 
    limit=limit
).tessellation

And finally the spatial weights

sw1 = momepy.sw_high(k=1, gdf=tessellation, ids='uID')
bldgs['neighbour_dist'] = momepy.NeighborDistance(bldgs, sw1, 'uID').series

@martinfleis Feel free to ask for more explanations if this is not enough clear. Maybe I don't correctly understand a part of the process.

martinfleis commented 4 years ago

Thanks! It is all clear. Regarding tessellation, I was right. Warnings are caused by incorrect geometry (overlaps), see the image below.

Screenshot 2019-12-13 15 53 15

But that should not affect momepy.NeighborDistance. Your dataset is big, so it is not easy to quickly check the error. Can you paste the error thrown by NeighborDistance? (btw, I just realised that in case of no neighbours, distance should be NaN rather than 0)

martinfleis commented 4 years ago

@thomleysens I think I figured it out. Because sw1 is generated from tessellation which does not match buildings (due to errors above), it happens that some buildings are not included in sw1 and that throws an error. Your try/except approach is probably good in this case and it is worth a fix. The same situation will happen for couple of other functions as well. But an actual error thrown would help anyway.

martinfleis commented 4 years ago

@thomleysens try this, I'll make a bug fix today or during the weekend if it solves the issue.

    def __init__(self, gdf, spatial_weights, unique_id):
        self.gdf = gdf
        self.sw = spatial_weights
        self.id = gdf[unique_id]
        # define empty list for results
        results_list = []

        data = gdf.set_index(unique_id)

        # iterating over rows one by one
        for index, row in tqdm(data.iterrows(), total=data.shape[0]):
            if index in spatial_weights.neighbors.keys():
                neighbours = spatial_weights.neighbors[index]
                building_neighbours = data.loc[neighbours]
                if len(building_neighbours) > 0:
                    results_list.append(
                        np.mean(building_neighbours.geometry.distance(row["geometry"]))
                    )
                else:
                    results_list.append(np.nan)
            else:
                results_list.append(np.nan)

        self.series = pd.Series(results_list, index=gdf.index)

thomleysens commented 4 years ago

Sorry for the big dataset. I should have thought about checking the geometries with QGIS, sorry for that. Unfortunately I didn't keep the error message but it was a DataFrame key error and this error corresponds to your analysis "Because sw1 is generated from tessellation which does not match buildings (due to errors above), it happens that some buildings are not included in sw1 and that throws an error".

I'm gonna try your fix, thanks a lot. It could take about 20-30 minutes because my computer is pretty slow.

thomleysens commented 4 years ago

Hi,

I ran the NeighborDistance with my fix then with your fix and I have compared the 2 buildings dataframes upated with ["neighbour_dist"]. They are identical.

But when I want to plot the buildings dataframe from your fix (with code below) ...

f, ax = plt.subplots(figsize=(15, 15))
bldgs.plot(
    ax=ax, 
    column='neighbour_dist', 
    scheme='naturalbreaks', 
    k=5, 
    legend=True, 
    cmap='OrRd_r'
)

I had to drop the NaN values in bldgs["neighbour_dist"] to avoid the error below (don't bother me and it is a correct behavior but I think it's worth to be mentioned, especially for your great illustrated examples)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-3bcd232fce97> in <module>
     10     k=5,
     11     legend=True,
---> 12     cmap='OrRd_r'
     13 )
     14 ax.set_facecolor("black")

~/anaconda3/envs/space/lib/python3.7/site-packages/geopandas/geodataframe.py in plot(self, *args, **kwargs)
    604         from there.
    605         """
--> 606         return plot_dataframe(self, *args, **kwargs)
    607 
    608     plot.__doc__ = plot_dataframe.__doc__

~/anaconda3/envs/space/lib/python3.7/site-packages/geopandas/plotting.py in plot_dataframe(df, column, cmap, color, ax, cax, categorical, legend, scheme, k, vmin, vmax, markersize, figsize, legend_kwds, classification_kwds, **style_kwds)
    532             classification_kwds["k"] = k
    533 
--> 534         binning = _mapclassify_choro(values, scheme, **classification_kwds)
    535         # set categorical to True for creating the legend
    536         categorical = True

~/anaconda3/envs/space/lib/python3.7/site-packages/geopandas/plotting.py in _mapclassify_choro(values, scheme, **classification_kwds)
    719             del classification_kwds["k"]
    720     try:
--> 721         binning = scheme_class(values, **classification_kwds)
    722     except TypeError:
    723         raise TypeError("Invalid keyword argument for %r " % scheme)

~/anaconda3/envs/space/lib/python3.7/site-packages/mapclassify/classifiers.py in __init__(self, y, k, initial)
   1652         self.k = k
   1653         self.init = initial
-> 1654         MapClassifier.__init__(self, y)
   1655         self.name = "NaturalBreaks"
   1656 

~/anaconda3/envs/space/lib/python3.7/site-packages/mapclassify/classifiers.py in __init__(self, y)
    539         self.name = "Map Classifier"
    540         self.y = y
--> 541         self._classify()
    542         self._summary()
    543 

~/anaconda3/envs/space/lib/python3.7/site-packages/mapclassify/classifiers.py in _classify(self)
    550 
    551     def _classify(self):
--> 552         self._set_bins()
    553         self.yb, self.counts = bin1d(self.y, self.bins)
    554 

~/anaconda3/envs/space/lib/python3.7/site-packages/mapclassify/classifiers.py in _set_bins(self)
   1673             self.k = k
   1674         else:
-> 1675             res0 = natural_breaks(x, k, init=self.init)
   1676             fit = res0[2]
   1677             self.bins = np.array(res0[-1])

~/anaconda3/envs/space/lib/python3.7/site-packages/mapclassify/classifiers.py in natural_breaks(values, k, init)
    399         Warn("Warning: setting k to %d" % uvk, UserWarning)
    400         k = uvk
--> 401     kres = _kmeans(values, k, n_init=init)
    402     sids = kres[-1]  # centroids
    403     fit = kres[-2]

~/anaconda3/envs/space/lib/python3.7/site-packages/mapclassify/classifiers.py in _kmeans(y, k, n_init)
    350     y = y * 1.0  # KMEANS needs float or double dtype
    351     y.shape = (-1, 1)
--> 352     result = KMEANS(n_clusters=k, init="k-means++", n_init=n_init).fit(y)
    353     class_ids = result.labels_
    354     centroids = result.cluster_centers_

~/anaconda3/envs/space/lib/python3.7/site-packages/sklearn/cluster/_k_means.py in fit(self, X, y, sample_weight)
    856         order = "C" if self.copy_x else None
    857         X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32],
--> 858                         order=order, copy=self.copy_x)
    859         # verify that the number of samples given is larger than k
    860         if _num_samples(X) < self.n_clusters:

~/anaconda3/envs/space/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    560         if force_all_finite:
    561             _assert_all_finite(array,
--> 562                                allow_nan=force_all_finite == 'allow-nan')
    563 
    564     if ensure_min_samples > 0:

~/anaconda3/envs/space/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     58                     msg_err.format
     59                     (type_err,
---> 60                      msg_dtype if msg_dtype is not None else X.dtype)
     61             )
     62     # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

thomleysens commented 4 years ago

Should I close this issue ?

martinfleis commented 4 years ago

Great! Thanks for the traceback. This is known issue to be resolved in geopandas/mapclassify, we'll take care of it.

This issue will be closed by #131 automatically.

Thanks again!

thomleysens commented 4 years ago

Ok, great !

Thanks a lot for your quick fix and your explanations. And again, thanks for this great library.

pysal / momepy

BUG: Tessellation buildings match & MultiPolygon elements and related spatial_weight inconsistence #130