Closed gjoseph92 closed 3 years ago
Took a look at xarray
tonight and I'm wondering how you think it might change the interface?
Well, it wouldn't change the soundDB interface at all. But it would mean that the return type would change depending on what kind of data you asked for, and how much, which is pretty confusing.
For example:
>>> soundDB.loudevents(ds).combine()
# returns a 4-D xarray
>>> soundDB.loudevents(ds).loc["all", :, :].combine()
# returns a pandas Panel
Or more importantly, it makes experimenting on a single entry and then scaling up a challenge, since the whole data structure changes:
>>> soundDB.loudevents(ds, n=1).combine()
# returns a pandas Panel
>>> soundDB.loudevents(ds).combine()
# returns a 4-D xarray
The reason it's an issue is because Accessor.combine()
combines the data from multiple files into a single structure in a logical way by putting them into the next-higher-dimensional data structure, i.e. SRCID is 2D (in a DataFrame), so for multiple SRCIDs (covering the same dates), it might add another dimension for site
, resulting in a 3D Panel. When a parser returns a 3D type from a single file, you'd need to combine them into 4D... which puts you in the realm of xarray.
But let's be realistic: this is a non-issue for everything but LoudEvents and Metrics, since they're the only parsers that actually return Panels. (And Metrics sort of doesn't count, because it's so complicated it can't be combined across sites anyway, though it does rely on Panel4D.) Everything else would continue to work as normal with pandas >= 0.19.0.
So, you could drop support for Metrics and LoudEvents and eliminate the issue entirely (since those are the only types that return a >2D structure). Or, use xarray instead of pandas.Panel4D in Accessor.combine()
since those are so rarely used anyway, the unexpected return type won't be an issue in normal use.
Or, use xarray instead of pandas.Panel4D in Accessor.combine() since those are so rarely used anyway, the unexpected return type won't be an issue in normal use.
I think this is how I'd prefer to approach the issue. Keep supporting the files, and for Metrics and Loudevents simply deal with the ugly difference in data structure between single and multiple files.
OR would it be possible to - instead of a pandas Panel - have Metrics and Loudevents in their singular form just return a 2D xarray object? And users (for these files) would then turn them into pandas DataFrames as needed.
have Metrics and Loudevents in their singular form just return a 2D xarray object?
Yes, this is the right answer. Having used xarray a bit now, I can say it's pretty good.
For LoudEvents, I think you can just pass paneldata
to xarray.DataArray
; might need to fix up the dimension names and coordinates.
Could be a little more work for Metrics, but might result in a cleaner implementation anyway.
Well, finally got around to working on implementing things today, Since pd.Panel
is now also deprecated, I skipped it entirely and used xarray.Dataset
. I've got loudevents working, if not elegantly.
paneldata = xr.Dataset(data_vars={"above":data.iloc[:, 0:24],
"all":data.iloc[:, 24:48],
"percent":data.iloc[:, 48:72]
}
)
paneldata = paneldata.rename({"date":"date", "dim_1":"hour"})
return paneldata
After loading the data, syntax like loud.above.to_pandas()
seems to work to get back the frame in short order - not too bad when it comes to slicing.
With tables of different sizes it's not so clear to me that this same idea will work for Metrics
? I may need some help understanding what you did. After reading it through a few times today it's still above my comprehension.
Update: still stuck on implementing Metrics
as an xarray.Dataset
object.
That said, I'm now able to import soundDB
modules (using sys
) in a generic Python 3.5 environment (i.e., with no special restrictions on pandas
or numpy
.) It seems like it's no longer possible to install pandas=0.18
via Anaconda, so this has become a necessity for working with NVSPL and such on new computers.
The only thing I needed to change was line 188 in parsers.py
: instead of raise_on_error=False
the parameter should now be errors='ignore'
.
@gjoseph92 should I submit a pull request for this and the loudevents modification from a few months ago?
pandas 0.19 deprecated Panel4D and PanelND, which was probably a good move, but was an essential if cumbersome part of some Accessors, especially Metrics.
Metrics,
.combine()
, and elsewhere need to use xarray instead, or if it seems too different of an interface for users to learn both, figure out a different solution, maybe with MultiIndex.