plotly / plotly_express

Plotly Express - Simple syntax for complex charts. Now integrated into plotly.py!
https://plot.ly/python/plotly-express/
MIT License
4 stars 0 forks source link

Add Geopandas support #29

Closed mazzma12 closed 3 years ago

mazzma12 commented 5 years ago

Hi,

Thanks for your amazing work, many custom function can now been deprecated and lots of keystroke are saved. If I may have a feature request it would be to support geopandas API for the geo plots.

If you are not familiar with this library, it inherits pd.DataFrame and embedd a custom geometry column that stores the geo object (Points, Polygone, Line ...). It would be great if the plots could be done based on the geometry automatically, without casting points, or specifing the that you want polygons...

Tell me if you want more details about this.

nicolaskruchten commented 5 years ago

Sounds like an interesting idea! Can you tell me more about the API you would envision and the resulting output?

mazzma12 commented 5 years ago

Glad you like it!

Basically, I think px could set rational defaults for different geom_types of the GeoDataFrame by using the API provided by geopandas.

Here is an example in pseudo-code just to showcase the API if you are not familiar with it


if isinstance(df, geopandas.GeoDataFrame):
    gdf = df # I know I am geo
    geom_type = gdf.geom_type.unique()

    if all(geom_type) == 'Polygon':
        # Treat as a polygon
    elif all(geom_type) == 'Points':
        # Set rational defaults
        lon = gdf.geometry.x
        lat = gdf.geometry.y # Mind the x, y
        bbox = gdf.bbox # Might be useful for zoom 
    else:
        NotImplementedError("Only Point and Polygon supported atm")

If I remember plotly uses geojson format for the API. In this case calling gdf.__geo_interface__ might be more advised instead of accessing the geometry property (at least for polygons)

More about the "geometry" column ehre

nicolaskruchten commented 5 years ago

OK cool, so what would you envision as a px API here? px.scatter_geo(line="geometry_column") or something like that? I'm not sure I see how geometry columns map onto the px or Plotly primitives at the moment...

mazzma12 commented 5 years ago

I will try to give more details about the API using an example from the gallery show :

px.scatter_mapbox(carshare, lat="centroid_lat", lon="centroid_lon", color="peak_hour", size="car_hours", 
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10)

I would expect that if the carshare is an instance of geopandas.GeoDataFrame with Point geometry types that the lat and lon column would be discovered by the method automatically so you just have to call :

px.set_mapbox_access_token(open(".mapbox_token").read())
px.scatter_mapbox(carshare, color="peak_hour", size="car_hours", 
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10)

The same thing could be done if your geometry is of type Polygon.

nicolaskruchten commented 5 years ago

I see. I think that just grabbing a Point column is a bit too automatic for my tastes... what about when there are two Point columns? I could see a case for px.scatter_mapbox(point="point_col") for sure though.

mazzma12 commented 5 years ago

Actually, the point of geopandas is to bring some structure to a DataFrame, with only one active geometry at a time. Hence it's rational to display this geometry by default, according to its geom_type.

There is no reason to have a geodataframe with one geometry active and wanted to display another one. If that occurs one can just use geopandas.set_geoemtry() methods that acts just like pandas.set_index(), or override the lon, lat kwargs in the plot method

nicolaskruchten commented 5 years ago

Ah I see, thanks for that extra bit of context :)

Seems like a reasonable and reasonably small thing to add... Any obvious downsides?

As a sidenote: right now px doesn't look at the index of a data frame at all, and I couldn't think of a good default behaviour for it in most cases. Any opinions?

mazzma12 commented 5 years ago

Seems like a reasonable and reasonably small thing to add... Any obvious downsides?

I can't think of any

As a sidenote: right now px doesn't look at the index of a data frame at all, and I couldn't think of good default behaviour for it in most cases. Any opinions?

I encountered that problem several times in other circumstances. I haven't found a nice solution either, I usually ignore the index, and assume that one shall reset if one that uses it.

but I reckon it's boring to pass a column name and then realize it is an index. You could try to reset the index at the beginning (it's a copy anyway) but you'll have to deal with other problem such that potential duplicate in column name...

mazzma12 commented 5 years ago

As a sidenote: right now px doesn't look at the index of a data frame at all, and I couldn't think of a good default behaviour for it in most cases. Any opinions?

I just met a use case where it might be useful: when you pass an instance of series to the scatter plot, you would like the default to assuming x is in index and y is the values. At the moment the only way I found to do it is a bit tedious :

  1. Reset the index (will cast the Series into a DataFrame)
  2. Pass the index name to x
  3. Pass the column name to y
nicolaskruchten commented 5 years ago

OK, thanks for the input! Basically in certain cases (2d-cartesian plots) you would like the default value of x to be the data frame index? This basically precludes the notion of having multiple data points at the same x value, as index values must be unique, right? Also I don't think we can easily support multi-level indices just yet. (plotly.js supports 2-level axes for 2d cartesian plots but this isn't exposed in px at the moment).

At this point I don't think we're going to support passing in Series rather than DataFrames directly, as we need the column names all over the place for labelling.

mazzma12 commented 5 years ago

Hi, I try not to derive too much on this as it is not related to this issue.

OK, thanks for the input! Basically in certain cases (2d-cartesian plots) you would like the default value of x to be the data frame index? yes This basically precludes the notion of having multiple data points at the same x value, as index values must be unique, right? Not intuitive at first first glance, but indices are not necessarily unique in pandas Also I don't think we can easily support multi-level indices just yet. (plotly.js supports 2-level axes for 2d cartesian plots but this isn't exposed in px at the moment). Ofc I only intend a simple case with 1D index atm, just raise not implemented error instead At this point I don't think we're going to support passing in Series rather than DataFrames directly, as we need the column names all over the place for labelling.

It's actually quite easy to grab the x and y column names from the series ts' by doingx=ts.index.nameandy=ts.name. Then when you callts.reset_index()it will return a new dataframe object with columns[x, y]`

Happy to detail a bit longer in another post if needed :)

nicolaskruchten commented 5 years ago

OK so re indexes there's another issue here now #37 where I outline a different approach :)

DmitriyG228 commented 5 years ago

That would be great to have geopandas support to be able to plot shapely Polygons. imho, I even better solution would be to enable this function in Plotly first

andychang-1 commented 5 years ago

I would love to see this, I am drawing a map a zip code overlay as well as individual colored data points in Folium currently.

Folium can't even support >1000 points without clusters and I would heavily prefer to use plotly express for my task due to its way better speed.

nicolaskruchten commented 5 years ago

This is something I'm looking into in September! :)

chriddyp commented 5 years ago

(In the interim, check out the new choroplethmapbox chart type: https://plot.ly/python/mapbox-county-choropleth/!)

nicolaskruchten commented 4 years ago

We're wrapping up https://github.com/plotly/plotly.py/issues/1767 and then we'll tackle https://github.com/plotly/plotly.py/issues/1780 and add geopandas support :)

mazzma12 commented 4 years ago

Great ! For the implementation, you might want to take this into account for performance (with points geodataframe only) https://github.com/geopandas/geopandas/issues/964

emigre459 commented 4 years ago

As this is still marked as open I wanted to give it a bump - geopandas support via plotly express would be amazing!

ccdatapdx commented 4 years ago

Wanted to give this another bump, as geopandas support would be very helpful!

alsobay commented 4 years ago

Adding another voice to this -- would be very helpful, and happy to help develop if someone can describe a high-level blueprint of what to do.

nicolaskruchten commented 4 years ago

A quick update on this: we have pretty decent support (and no geopandas-specific documentation!) for displaying points and polygons with scatter_mapbox and choropleth_mapbox today, but the big gap is displaying line/multi-line data.

shakasom commented 4 years ago

Adding Geopandas will be a great addition to the library. What is possible now with choropleth_mapbox now. I can figure out o plot polygons from Geopandas.

hydrogeohc commented 4 years ago

Adding Geopandas will be a great addition to the library. What is possible now with choropleth_mapbox now. I can figure out o plot polygons from Geopandas.

I would like to know the current status for plot polygons or point based on Geopandas dataframe. Also, I would like to know if there any contribution about linestring format from shapely to plotly.

Thanks :)!

ibhalin commented 4 years ago

I'm bumping this too ! It would be so helpful :)

kurt-rhee commented 3 years ago

Adding one more bump for what it is worth

armgilles commented 3 years ago

Bumping too :)

nicolaskruchten commented 3 years ago

Thanks for all the bumps :)

There is pretty decent support for GeoPandas right now, it's mostly a question of adding some examples to the docs really. If you have a geo data frame with point data you can use scattergeo or scattermapbox and manually set latitude and longitude. If you have a geo data frame with polygons, you can use the geojson argument to choropleth or choroplethmapbox.

The one place we don't have good support is if you have a geo data frame with line data.. this one will require more thought.

armgilles commented 3 years ago

Thanks @nicolaskruchten

I'm having trouble to build the geojson argument to use it in choropleth from a Geopandas dataframe. Try my best to fit with this example (from Plotly doc)

import plotly.express as px

df = px.data.election()
geojson = px.data.election_geojson()

fig = px.choropleth(df, geojson=geojson, color="Bergeron",
                    locations="district", featureidkey="properties.district",
                    projection="mercator"
                   )
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

But for this toy example, geojson is already built ;)

nicolaskruchten commented 3 years ago

if gdf is a GeoPandas GeoDataFrame you should be able to just pass geojson=gdf.geometry I believe.

nicolaskruchten commented 3 years ago

This is why I'm saying it's mostly a documentation issue :)

armgilles commented 3 years ago

I've just made a toy exemple (maybe it could help for documentation) :

import geopandas as gpd
import plotly.express as px

#  GeoJson from French Open-Data (french department)
url = "https://www.data.gouv.fr/fr/datasets/r/90b9341a-e1f7-4d75-a73c-bbc010c7feeb"

# Read file with geopandas
geo_df = gpd.read_file(url)
geo_df.head()

image

# Now using choropleth

fig = px.choropleth_mapbox(geo_df, 
                           geojson=geo_df.geometry, 
                           locations="nom", 
                           center={"lat": 48.8534, "lon": 2.3488},
                           zoom=4)
fig.show()

image

No polygon display

nicolaskruchten commented 3 years ago

Yes, you'll probably need to map color to some data to get things to show up :)

armgilles commented 3 years ago

I adding some code :


# To have a random value to use it to color
geo_df['random_color'] = np.random.randint(1, 6, geo_df.shape[0])

fig = px.choropleth_mapbox(geo_df, 
                           geojson=geo_df.geometry, 
                           locations="nom", 
                           center={"lat": 48.8534, "lon": 2.3488},
                           color="random_color",
                           mapbox_style="carto-positron", 
                           zoom=4)
fig.show()

Result : same as previous but I have a beautiful colormap in legend ;)

empet commented 3 years ago

@armgilles Your code doesn't work because your geojson=geo_df.geometry is not a geojson file. Choroplethmapbox accepts only a geojson file defined as a dict with the following structure:

geojson = {"type": "FeatureCollection",
                  "features": []
        }

That's why you have to convert the geo_df to a geojson file.

Here is a working code:

import geopandas as gpd
import pandas as pd
import numpy as np
import plotly.express as px
import json

url = "https://www.data.gouv.fr/fr/datasets/r/90b9341a-e1f7-4d75-a73c-bbc010c7feeb"
geo_df = gpd.read_file(url)
#geo_df.head()

#convert the geo-dataframe to geojson
my_geojson = json.loads(geo_df.to_json())
#define a dataframe with data for choroplethmapbox
df = pd.DataFrame(dict(code=list(geo_df['code']),
                    ))
np.random.seed(123)
df['vals'] = np.random.randint(1, 8, geo_df.shape[0])
#df.head()

fig = px.choropleth_mapbox(df, 
                           geojson=my_geojson, 
                           featureidkey='properties.code',
                           locations="code", 
                           center={"lat": 47.35, "lon": 2.3},
                           color="vals",
                           mapbox_style="carto-positron", 
                           zoom=4)

It is isn't recommended to pass geo_df to px.choropleth_mapbox, instead of newly defined dataframe, df, because geo_df is a bigger file, containing the geometry of all polygons, which is already passed via geojson=my_geojson.

nicolaskruchten commented 3 years ago

I did actually add special handling in PX for the .geometry case where it extracts the geojson internally... not sure why it's not working in this specific case!

nicolaskruchten commented 3 years ago

OK so I just needed to peek under the hood a bit...

passing geojson=geo_df.geometry does work, but locations must be set to geo_df.index in this case.

It is isn't recommended to pass geo_df to px.choropleth_mapbox, instead of newly defined dataframe, df, because geo_df is a bigger file, containing the geometry of all polygons, which is already passed via geojson=my_geojson.

I'll respectfully disagree here... the size of geo_df doesn't matter, and it is recommended to set data_frame=geo_df for GeoPandas dataframes: PX only extracts the columns it needs (so the number of columns doesn't matter), and is able to extract the geojson from geo_df.geometry as it specifically looks for the __geo_interface__ attribute, here https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/plotly/express/_chart_types.py#L1147

nicolaskruchten commented 3 years ago

In any case: none of this is documented yet under plotly.com/python which is why this issue remains open :)

nicolaskruchten commented 3 years ago

Here's a complete/simple example (edited to remove the unnecessary .__geo_interface__ I'd left in for testing :)

import numpy as np
import geopandas as gpd
import plotly.express as px

#  GeoJson from French Open-Data (french department)
url = "https://www.data.gouv.fr/fr/datasets/r/90b9341a-e1f7-4d75-a73c-bbc010c7feeb"

# Read file with geopandas
geo_df = gpd.read_file(url)
geo_df['random_color'] = np.random.randint(1, 6, geo_df.shape[0])
fig = px.choropleth_mapbox(geo_df, 
                           geojson=geo_df.geometry, 
                           locations=geo_df.index, 
                           color='random_color',
                           center={"lat": 48.8534, "lon": 2.3488},
                           mapbox_style="open-street-map",
                           zoom=4)
fig.show()

image

armgilles commented 3 years ago

Thanks @nicolaskruchten & @empet for your example 🥇

Using geopandas DataFrame and his geometry as geojson argument is pretty cleaver. Didn't understand location argument previously, now it's good.

Little remark, with the previous code, trying locations=geo_df.code will display figure by with some holes :

image

Don't understand why (maybe string type ?)

nicolaskruchten commented 3 years ago

The locations key serves to identify which polygon in geojson the values from color should match to. In the base case where the values in color come from the same data frame as the polygons, using geo_df.index is the only thing that makes sense basically. If you set it to some other sequence of numbers you'll get a map but the colors won't match the polygons. If you set it to a string column, in some cases the numeric-string-to-number comparison will work like it did above and you'll get an odd result. In this case it looks like 01 through 09 didn't get matched and everything else did.

I've added a GeoPandas example similar to the one above to each of the following pages btw:

I chose a different dataset because, confusingly, the one you're using is actually a GeoJSON object already... you're loading it via GeoPandas but you could also have just loaded it as a dict ;) The examples I've used above for choropleths load data from a shapefile, which i hope is less likely to be confusing for users.

armgilles commented 3 years ago

thank you for the explanations !

You did well to chose a different dataset. I hope it helps communities :)

nicolaskruchten commented 3 years ago

While I was in there, I added some GeoPandas examples to:

It's not as graceful as the polygon/point support but at least it's in the docs now :)

I'll close this issue in favour of more specific proposals in the Plotly.py repo such as https://github.com/plotly/plotly.py/issues/2601