iliatimofeev / gpdvega

gpdvega is a bridge between GeoPandas and Altair that allows to seamlessly chart geospatial data
https://iliatimofeev.github.io/gpdvega/
BSD 3-Clause "New" or "Revised" License
52 stars 5 forks source link

fix maxrows error #4

Closed afonit closed 6 years ago

afonit commented 6 years ago

The maxrows error will still come with this current configuration. I had to take out alt.limit_rows. After doing that, I can plot large geographic plots.

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 18


Totals Coverage Status
Change from base Build 17: 0.0%
Covered Lines: 123
Relevant Lines: 124

💛 - Coveralls
iliatimofeev commented 6 years ago

Thank you for your contribution. The max rows case is not covered by tests for now, I'll check what I can do. But I'd rather prefer to understand why 'pipe' not works as expected and fix it than exclude functionality.

afonit commented 6 years ago

Sure thing, thanks for taking a look at the request.

Here is a small reproducible example based on the readme.

I am just adding in some more points to push it over the level.

In this example you will still get the max rows error:

import altair as alt
import geopandas as gpd
import gpdvega
import pandas as pd
from shapely.geometry import Point
from gpdvega import gpd_to_values

alt.data_transformers.register(
    'gpd_to_values',
    lambda data: alt.pipe(data, alt.limit_rows, gpd_to_values)
)

alt.data_transformers.enable('gpd_to_values')

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# GeoDataFrame could be passed as usual pd.DataFrame
chart_one = alt.Chart(world[world.continent!='Antarctica']).mark_geoshape(
).project(
).encode(
    color='pop_est', # shorthand infer types as for regular pd.DataFrame
    tooltip='id:Q' # GeoDataFrame.index is accessible as id
).properties(
    width=500,
    height=300
)

# generate some points to push us over the max rows
some = [[-70.05179, 25.10815] for x in range(6000)]

some = pd.DataFrame(some, columns=['x', 'y'])

some['Coordinates'] = list(zip(some.x, some.y))
some['Coordinates'] = some['Coordinates'].apply(Point)
gdfo = gpd.GeoDataFrame(some, geometry='Coordinates')
chart_two = alt.Chart(gdfo).mark_point(color='red').encode(#.mark_point(size=550, color='orange').encode(
    longitude='x:Q',
    latitude='y:Q'
)

chart_one + chart_two

But then if we change this line:

 lambda data: alt.pipe(data, alt.limit_rows, gpd_to_values)

to:

 lambda data: alt.pipe(data, gpd_to_values)

We then get the plot from the below code:

import altair as alt
import geopandas as gpd
import gpdvega
import pandas as pd
from shapely.geometry import Point
from gpdvega import gpd_to_values

alt.data_transformers.register(
    'gpd_to_values',
    lambda data: alt.pipe(data, alt.limit_rows, gpd_to_values)
)

alt.data_transformers.enable('gpd_to_values')

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# GeoDataFrame could be passed as usual pd.DataFrame
chart_one = alt.Chart(world[world.continent!='Antarctica']).mark_geoshape(
).project(
).encode(
    color='pop_est', # shorthand infer types as for regular pd.DataFrame
    tooltip='id:Q' # GeoDataFrame.index is accessible as id
).properties(
    width=500,
    height=300
)

# generate some points to push us over the max rows
some = [[-70.05179, 25.10815] for x in range(6000)]

some = pd.DataFrame(some, columns=['x', 'y'])

some['Coordinates'] = list(zip(some.x, some.y))
some['Coordinates'] = some['Coordinates'].apply(Point)
gdfo = gpd.GeoDataFrame(some, geometry='Coordinates')
chart_two = alt.Chart(gdfo).mark_point(color='red').encode(#.mark_point(size=550, color='orange').encode(
    longitude='x:Q',
    latitude='y:Q'
)

chart_one + chart_two

image

afonit commented 6 years ago

@iliatimofeev , ok, after reading through this, and the altair codebase, I think I now understand.

The limit_rows is expecting a max_rows argument.

So this works:

lambda data: alt.pipe(data, alt.limit_rows(max_rows=100000), gpd_to_values)

or in my case since I did not want to limit any rows this works also:

lambda data: alt.pipe(data, gpd_to_values)

but this line as it currently is in the geodata.py file will still cause the max_rows error:

lambda data: alt.pipe(data, alt.limit_rows, gpd_to_values)

So is it safe to say that gpdvega geodata.py file should either have the parameter of max_rows populated, or it should leave the alt.limit_rows out in the current file.

I would love to modify my pull request depending on what you would like to have happen.

afonit commented 6 years ago

Alright, so this looks like it was a misunderstanding on my part - based on some earlier errors I was getting that I had posted in another issue. I will think through this a bit more and see if there is a clarification I can make in the documentation, or if this is just a perception issue I had.

iliatimofeev commented 6 years ago

@afonit the are is a bug in gpdvega it expected to work as Altair do:

alt.data_transformers.enable('gpd_to_values',max_rows=None)

Transformer should be registered slightly different to works as expected. It's my mistake thank you for finding it.

from toolz.curried import curry, pipe
@curry
def gpd_to_values_data_transformer(data, max_rows=5000):
    return pipe(data, alt.limit_rows(max_rows=max_rows), gpd_to_values)

alt.data_transformers.register(
    'gpd_to_values',
    gpd_to_values_data_transformer
)