Read_from_postgis functions fail for chunksize!=None

henrymartin1 commented 2 years ago

The error can be reproduced by setting the chunksize argument in any of the test_read tests in test_postgis e.g., here

The problem seems to be that gpd.GeoDataFrame.from_postgis returns a generator instead of a geodataframe. In the documentation of gpd.GeoDataFrame.from_postgis it says that one should use gpd.read_postgis maybe this already fixes the problem.

gdf = <generator object _read_postgis.<locals>.<genexpr> at 0x00000210EBC83EB0>
set_names = {'finished_at': 'finished_at', 'started_at': 'started_at', 'user_id': 'user_id'}
geom_col = None, crs = None, tz_cols = ['started_at', 'finished_at'], tz = None

    def _trackintel_model(gdf, set_names=None, geom_col=None, crs=None, tz_cols=None, tz=None):
        """Help function to assure the trackintel model on a GeoDataFrame.

        Parameters
        ----------
        gdf : GeoDataFrame
            Input GeoDataFrame

        set_names : dict, optional
            Renaming dictionary for the columns of the GeoDataFrame.

        set_geometry : str, optional
            Set geometry of GeoDataFrame.

        crs : pyproj.crs or str, optional
            Set coordinate reference system. The value can be anything accepted
            by pyproj.CRS.from_user_input(), such as an authority string
            (eg "EPSG:4326") or a WKT string.

        tz_cols : list, optional
            List of timezone aware datetime columns.

        tz : str, optional
            pytz compatible timezone string. If None UTC will be assumed

        Returns
        -------
        gdf : GeoDataFrame
            The input GeoDataFrame transformed to match the trackintel format.
        """
        if set_names is not None:
>           gdf = gdf.rename(columns=set_names)
E           AttributeError: 'generator' object has no attribute 'rename'

trackintel\io\from_geopandas.py:399: AttributeError

bifbof commented 2 years ago

Uff :D To be honest I am not quite sure if we can fix this one. Like most of our function depend on that we have the whole dataset in memory for groupby, sorting and such. Unless we consume the iterator into a big dataframe this problem consists but then the chunksize parameter is not that useful.

What would be your usecase?

I would rather add a more useful error message.

henrymartin1 commented 2 years ago

Hm... I see your point. So I am currently working with a large dataset that barely fits into my memory. There seems to be some overhead related to reading/writing to/from postgis which is enough to increase the memory consumption so that reading/writing operations fail in this case. This overhead seems to be lower if chunksize!=None meaning that I can send/read the data without that it fails.

I am not sure how we would change it to be honest so it might be best to simply add a better error message or a check that says that the chunksize argument is not supported at the moment. At least until we have a proper big data strategy for trackintel ;-)

mie-lab / trackintel

Read_from_postgis functions fail for chunksize!=None #416