scikit-hep / uproot3

ROOT I/O in pure Python and NumPy.
BSD 3-Clause "New" or "Revised" License
314 stars 67 forks source link

daskframe determines wrong dtypes (rel 3.12.0) #523

Closed ArthurBolz closed 3 years ago

ArthurBolz commented 3 years ago

When opening a file as a dask dataframe via uproot.daskframe branch data types are not determined correctly:

file = uproot.open("http://scikit-hep.org/uproot/examples/Zmumu.root")# , "events", "E2", namedecode=None)
df   = file['events'].pandas.df()
print("pandas dataframe dtypes")
print(df.dtypes)

ddf = uproot.daskframe("http://scikit-hep.org/uproot/examples/Zmumu.root", 'events')
print("\ndask dataframe dtypes")
print(ddf.dtypes)

pandas dataframe dtypes Type object Run int32 Event int32 E1 float64 px1 float64 py1 float64 pz1 float64 pt1 float64 eta1 float64 phi1 float64 Q1 int32 E2 float64 px2 float64 py2 float64 pz2 float64 pt2 float64 eta2 float64 phi2 float64 Q2 int32 M float64 dtype: object

dask dataframe dtypes Type float64 Run float64 Event float64 E1 float64 px1 float64 py1 float64 pz1 float64 pt1 float64 eta1 float64 phi1 float64 Q1 float64 E2 float64 px2 float64 py2 float64 pz2 float64 pt2 float64 eta2 float64 phi2 float64 Q2 float64 M float64 dtype: object

ArthurBolz commented 3 years ago

Seems to be a problem with dask.array.

jpivarski commented 3 years ago

In Uproot 4, I've removed the Dask array and frame interfaces until I understand them better. They require arrays to have some very NumPy-like features, such as a dtype, which can't be satisfied by Awkward arrays, and we need to use Awkward arrays to implement the virtualness.

Since I'm recommending anyone who is not locked into the old API to move to the new one (it's documented!), it's best not to rely on getting Dask objects from Uproot.

I think that Dask is important and we should eventually use it, but it might require some development on both sides to loosen some of Dask's expectations about the interface to arrays.