gjoseph92 / soundDB

Query and load NSNSD acoustic data into Python, minimize pain
Other
5 stars 1 forks source link

Support pandas >= 0.19.0 #5

Closed gjoseph92 closed 3 years ago

gjoseph92 commented 7 years ago

pandas 0.19 deprecated Panel4D and PanelND, which was probably a good move, but was an essential if cumbersome part of some Accessors, especially Metrics.

Metrics, .combine(), and elsewhere need to use xarray instead, or if it seems too different of an interface for users to learn both, figure out a different solution, maybe with MultiIndex.

dbetchkal commented 6 years ago

Took a look at xarray tonight and I'm wondering how you think it might change the interface?

gjoseph92 commented 6 years ago

Well, it wouldn't change the soundDB interface at all. But it would mean that the return type would change depending on what kind of data you asked for, and how much, which is pretty confusing.

For example:

>>> soundDB.loudevents(ds).combine()
    # returns a 4-D xarray
>>> soundDB.loudevents(ds).loc["all", :, :].combine()
    # returns a pandas Panel

Or more importantly, it makes experimenting on a single entry and then scaling up a challenge, since the whole data structure changes:

>>> soundDB.loudevents(ds, n=1).combine()
    # returns a pandas Panel
>>> soundDB.loudevents(ds).combine()
    # returns a 4-D xarray

The reason it's an issue is because Accessor.combine()combines the data from multiple files into a single structure in a logical way by putting them into the next-higher-dimensional data structure, i.e. SRCID is 2D (in a DataFrame), so for multiple SRCIDs (covering the same dates), it might add another dimension for site, resulting in a 3D Panel. When a parser returns a 3D type from a single file, you'd need to combine them into 4D... which puts you in the realm of xarray.

But let's be realistic: this is a non-issue for everything but LoudEvents and Metrics, since they're the only parsers that actually return Panels. (And Metrics sort of doesn't count, because it's so complicated it can't be combined across sites anyway, though it does rely on Panel4D.) Everything else would continue to work as normal with pandas >= 0.19.0.

So, you could drop support for Metrics and LoudEvents and eliminate the issue entirely (since those are the only types that return a >2D structure). Or, use xarray instead of pandas.Panel4D in Accessor.combine() since those are so rarely used anyway, the unexpected return type won't be an issue in normal use.

dbetchkal commented 5 years ago

Or, use xarray instead of pandas.Panel4D in Accessor.combine() since those are so rarely used anyway, the unexpected return type won't be an issue in normal use.

I think this is how I'd prefer to approach the issue. Keep supporting the files, and for Metrics and Loudevents simply deal with the ugly difference in data structure between single and multiple files.

OR would it be possible to - instead of a pandas Panel - have Metrics and Loudevents in their singular form just return a 2D xarray object? And users (for these files) would then turn them into pandas DataFrames as needed.

gjoseph92 commented 5 years ago

have Metrics and Loudevents in their singular form just return a 2D xarray object?

Yes, this is the right answer. Having used xarray a bit now, I can say it's pretty good.

For LoudEvents, I think you can just pass paneldata to xarray.DataArray; might need to fix up the dimension names and coordinates.

Could be a little more work for Metrics, but might result in a cleaner implementation anyway.

dbetchkal commented 4 years ago

Well, finally got around to working on implementing things today, Since pd.Panel is now also deprecated, I skipped it entirely and used xarray.Dataset. I've got loudevents working, if not elegantly.

paneldata = xr.Dataset(data_vars={"above":data.iloc[:, 0:24],
                                  "all":data.iloc[:, 24:48],
                                  "percent":data.iloc[:, 48:72]
                                 }
                                )

        paneldata = paneldata.rename({"date":"date", "dim_1":"hour"}) 

        return paneldata

After loading the data, syntax like loud.above.to_pandas() seems to work to get back the frame in short order - not too bad when it comes to slicing.

With tables of different sizes it's not so clear to me that this same idea will work for Metrics? I may need some help understanding what you did. After reading it through a few times today it's still above my comprehension.

dbetchkal commented 4 years ago

Update: still stuck on implementing Metrics as an xarray.Dataset object.


That said, I'm now able to import soundDB modules (using sys) in a generic Python 3.5 environment (i.e., with no special restrictions on pandas or numpy.) It seems like it's no longer possible to install pandas=0.18 via Anaconda, so this has become a necessity for working with NVSPL and such on new computers.

The only thing I needed to change was line 188 in parsers.py: instead of raise_on_error=False the parameter should now be errors='ignore'.


@gjoseph92 should I submit a pull request for this and the loudevents modification from a few months ago?