holoviz / spatialpandas

Pandas extension arrays for spatial/geometric operations
BSD 2-Clause "Simplified" License
308 stars 25 forks source link

`pack_partitions_to_parquet` stalling on concat_parts. #29

Closed dharhas closed 4 years ago

dharhas commented 4 years ago

ALL software version info

(this library, plus any other relevant software, e.g. bokeh, python, notebook, OS, browser, etc) python 3.8 ( I think I tried 3.7 as well, need to check) spatialpandas 0.3.5 dask 2.11.0 distributed 2.11.0

Description of expected behavior and the observed behavior

Attempting to read in some astronomy data, create a spatial index and then pack the spatially sorted data to parquet.

pack_partitions_to_parquet keeps stalling on concat_parts everything seems to process fine till then.

It does create a directory structure with the correct number of partition folders and it does create a small parquet file i.e.

test_sorted/part.0.parquet/part.0.parquet
test_sorted/part.1.parquet/part.0.parquet
...

some folders have multiple parts but still tiny.

Complete, minimal, self-contained example code that reproduces the issue

Unzip the parquet file below: test.parq.zip

# code goes here between backticks
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster

from spatialpandas import GeoDataFrame
from spatialpandas.geometry import PointArray

# mainly to get a dask diagnostic dashboard
cluster = LocalCluster()
client = Client(cluster)
client

ddf = dd.read_parquet('test.parq')
ddf2 = ddf.map_partitions(
        lambda df: GeoDataFrame(dict(
            position=PointArray(df[['ra', 'dec']]),
            **{col: df[col] for col in df.columns}
        ))
    )
ddf_packed = ddf2.pack_partitions_to_parquet('test_sorted.parq')

Screenshots or screencasts of the bug in action

image

jonmmease commented 4 years ago

Thanks for the report @dharhas. Something seems to be going wrong with reading the parquet file using Dask / pyarrow.

Here's an example that doesn't use spatialpandas

import dask.dataframe as dd
import dask
dask.config.set(scheduler='single-threaded')

ddf = dd.read_parquet('test.parq', engine='pyarrow')
ddf2 = ddf.map_partitions(lambda df: df)
len(ddf2)

When I run this, the kernel dies without an error message. Could you try out this snipped on your end?

jonmmease commented 4 years ago

I think I've narrowed it down to the label column. This doesn't crash for me:

import dask.dataframe as dd
import dask
dask.config.set(scheduler='single-threaded')

ddf = dd.read_parquet(
    'test.parq', engine='pyarrow',
    columns=[
        'base_Footprint_nPix', 'Gaussian-PSF_magDiff_mmag', 'CircAper12pix-PSF_magDiff_mmag',
        'Kron-PSF_magDiff_mmag', 'CModel-PSF_magDiff_mmag', 'traceSdss_pixel',
        'traceSdss_fwhm_pixel', 'psfTraceSdssDiff_percent', 'e1ResidsSdss_milli',
        'e2ResidsSdss_milli', 'merge_measurement_N515', 'qaBad_flag', 'patchId', 'dec',
#         'label',
        'psfMag', 'ra', 'filter', 'dataset', 'tractId'
    ]
)
ddf2 = ddf.map_partitions(lambda df: df)
len(ddf2)

Basic reading using pandas seems to work ok. Here's the label column:

import pandas as pd
df = pd.read_parquet('test.parq/part.0.parquet', engine='pyarrow')
df.label
id
42288338190731017      star
42287234384146819      null
42287496377152672      star
42288205046754755      star
42287659585920358    galaxy
                      ...  
42287638111099352    galaxy
42288196456839415    galaxy
42288333895771720    galaxy
42287376118090355    galaxy
42287526441937357    galaxy
Name: label, Length: 93, dtype: object

My hunch is that this has something to do with the null entries in this column.

timothydmorton commented 4 years ago

It would be nice to not have to dump those nulls before writing, but if that's what I have to do for now, that's a fine workaround.

dharhas commented 4 years ago

@jonmmease original dataset is here:

https://quansight.nyc3.digitaloceanspaces.com/datasets/lsst/DM-21335-1Perc.tar.gz

Any of the Coadd parquet data files. I tried to do a .sample to make them smaller, not sure if that broke anything but the smaller file worked on my computer... Will try and recreate your read error in a new env.

dharhas commented 4 years ago

@timothydmorton is the data in that column just strings or are they some sort of object? Can they be cast to just strings?

timothydmorton commented 4 years ago

Strings is fine. Dunno why they're not.

On Tue, Feb 25, 2020 at 5:03 PM Dharhas Pothina notifications@github.com wrote:

@timothydmorton https://github.com/timothydmorton is the data in that column just strings or are they some sort of object? Can they be cast to just strings?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/holoviz/spatialpandas/issues/29?email_source=notifications&email_token=AAOOXW22JOFMANACXYTESELREW5WNA5CNFSM4K2YJT2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM6K7NI#issuecomment-591179701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOXW4ER6DHUWXVFXRNLMLREW5WNANCNFSM4K2YJT2A .

-- about.me/tim.morton

dharhas commented 4 years ago

Playing around with Kartothek and I realized that the label's column is a categorical type:

coadd_df.label.head()
id
42287221499236322      star
42287771255071859      star
42287496377175835    galaxy
42287775550046995    galaxy
42288192161843167    galaxy
Name: label, dtype: category
Categories (3, object): [star, galaxy, null]
dharhas commented 4 years ago

This actually seems to be a problem in the data. Dask cannot read the label column with fastparquet or pyarrow. Pandas seems to do find. We have a workaround so this is no longer relevent.