Closed dharhas closed 4 years ago
Thanks for the report @dharhas. Something seems to be going wrong with reading the parquet file using Dask / pyarrow.
Here's an example that doesn't use spatialpandas
import dask.dataframe as dd
import dask
dask.config.set(scheduler='single-threaded')
ddf = dd.read_parquet('test.parq', engine='pyarrow')
ddf2 = ddf.map_partitions(lambda df: df)
len(ddf2)
When I run this, the kernel dies without an error message. Could you try out this snipped on your end?
I think I've narrowed it down to the label
column. This doesn't crash for me:
import dask.dataframe as dd
import dask
dask.config.set(scheduler='single-threaded')
ddf = dd.read_parquet(
'test.parq', engine='pyarrow',
columns=[
'base_Footprint_nPix', 'Gaussian-PSF_magDiff_mmag', 'CircAper12pix-PSF_magDiff_mmag',
'Kron-PSF_magDiff_mmag', 'CModel-PSF_magDiff_mmag', 'traceSdss_pixel',
'traceSdss_fwhm_pixel', 'psfTraceSdssDiff_percent', 'e1ResidsSdss_milli',
'e2ResidsSdss_milli', 'merge_measurement_N515', 'qaBad_flag', 'patchId', 'dec',
# 'label',
'psfMag', 'ra', 'filter', 'dataset', 'tractId'
]
)
ddf2 = ddf.map_partitions(lambda df: df)
len(ddf2)
Basic reading using pandas seems to work ok. Here's the label
column:
import pandas as pd
df = pd.read_parquet('test.parq/part.0.parquet', engine='pyarrow')
df.label
id
42288338190731017 star
42287234384146819 null
42287496377152672 star
42288205046754755 star
42287659585920358 galaxy
...
42287638111099352 galaxy
42288196456839415 galaxy
42288333895771720 galaxy
42287376118090355 galaxy
42287526441937357 galaxy
Name: label, Length: 93, dtype: object
My hunch is that this has something to do with the null
entries in this column.
It would be nice to not have to dump those nulls before writing, but if that's what I have to do for now, that's a fine workaround.
@jonmmease original dataset is here:
https://quansight.nyc3.digitaloceanspaces.com/datasets/lsst/DM-21335-1Perc.tar.gz
Any of the Coadd
parquet data files. I tried to do a .sample to make them smaller, not sure if that broke anything but the smaller file worked on my computer... Will try and recreate your read error in a new env.
@timothydmorton is the data in that column just strings or are they some sort of object? Can they be cast to just strings?
Strings is fine. Dunno why they're not.
On Tue, Feb 25, 2020 at 5:03 PM Dharhas Pothina notifications@github.com wrote:
@timothydmorton https://github.com/timothydmorton is the data in that column just strings or are they some sort of object? Can they be cast to just strings?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/holoviz/spatialpandas/issues/29?email_source=notifications&email_token=AAOOXW22JOFMANACXYTESELREW5WNA5CNFSM4K2YJT2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM6K7NI#issuecomment-591179701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOXW4ER6DHUWXVFXRNLMLREW5WNANCNFSM4K2YJT2A .
-- about.me/tim.morton
Playing around with Kartothek and I realized that the label's column is a categorical type:
coadd_df.label.head()
id
42287221499236322 star
42287771255071859 star
42287496377175835 galaxy
42287775550046995 galaxy
42288192161843167 galaxy
Name: label, dtype: category
Categories (3, object): [star, galaxy, null]
This actually seems to be a problem in the data. Dask cannot read the label column with fastparquet or pyarrow. Pandas seems to do find. We have a workaround so this is no longer relevent.
ALL software version info
(this library, plus any other relevant software, e.g. bokeh, python, notebook, OS, browser, etc) python 3.8 ( I think I tried 3.7 as well, need to check) spatialpandas 0.3.5 dask 2.11.0 distributed 2.11.0
Description of expected behavior and the observed behavior
Attempting to read in some astronomy data, create a spatial index and then pack the spatially sorted data to parquet.
pack_partitions_to_parquet keeps stalling on concat_parts everything seems to process fine till then.
It does create a directory structure with the correct number of partition folders and it does create a small parquet file i.e.
some folders have multiple parts but still tiny.
Complete, minimal, self-contained example code that reproduces the issue
Unzip the parquet file below: test.parq.zip
Screenshots or screencasts of the bug in action