convert TimeSeries into Series

mroberge commented 8 years ago

Apparently Pandas has depreciated TimeSeries, but all of the functionality is contained in Series.

mroberge commented 8 years ago

I should also modify get_usgs to return data as a Series instead of a DataFrame. the 'name' attribute should be able to act as a key (dv01585200 or iv01585200, for example) to match with filenames if the data gets saved, or to match up with metadata that might get collected, or to match up with a baseflow timeseries for the same site.

stijnvanhoey commented 8 years ago

Would it be an option to work with DataFrames as base for the hydropy datatype (currenty the .data of the attribute is a dataframe) of the HydroAnalysis class?

Henc, a dataframe as the .data attribute provides the option (but also difficulty) to dervive infromation of multiple gauges time series at the same time (e.g. get_peaks should provide the peaks for all of them).

Or should we opt for a Series as base, making all the methods work for a single gauge time series and make a separate class for the 'multiple-gauges' option?

mroberge commented 8 years ago

This is a good question! It never occurred to me to use the Series, so when I saw your usage of it, I thought it was such a good idea that I immediately wanted to use it too.

I've also been thinking that it would be nice to have a data format that allowed you to save the baseflow time series along with the discharge timeseries. But that leads to the question: Should the discharge and baseflow each make a column in a dataframe, and then each additional site could add an additional dataframe to create a Panel? Or should the baseflow go into the Panel, and each site is a new column in a dataframe?

I think to answer this we would have to try each, and see which is easiest to implement, and which is easiest for the user. If I can find some time, I have started playing around with this idea. I will post a branch as soon as I can.... ⌚️ ☹️

mroberge commented 8 years ago

I found some time to work on this some more. Here is the solution I am trying: Analysis is a new class that holds multiple sites. I do this in two ways (we can choose which is best, or keep them combined)

a list of Stations (see below)
a pandas Panel. Each 'item' is a dataframe that corresponds to a single Station. Station is a new class that holds data for a single site. To keep different types of data separate, I started with two types of data: dailymean, and realtime (instantaneous values collected every hour or less) HydroAnalysis is the class that you created to hold a DataFrame. I use this to hold dataframes inside of the Station class.

stijnvanhoey commented 8 years ago

Some first thoughts on this, but I think it needs to destilate a bit more:

I would not keep baseflow calculations (or other derivative calculations) inside the same DataFrame as flow data, this would be hard to properly book keep (lots of stuff with titles) while creating different classes when domain knowledge grows. Moreover, maybe it is not needed at all.
Using Series in a custom dtype to handle a variable (i.e. flow) time series and inherit this for multiple variables/stations/... would probably be missing all the benefits pd.DataFrames is providing. So, I would keep the pandas DataFrame as part of the 'core' class. The question is what the different columns are defining: variables (flow, rain,...), different stations (LP64A, LP64B,..) or different derived variables (baseflow,...)?
First, I was trying to keep the HydroAnalysis agnostic of the 'type' of time series (it can be flow, but also baseflow for a single site) on which you can apply different procedures (find peaks, etc...). The technical fact that a specific algorithm (e.g. get_baseflow function operating on HydroAnalysis when you can actually store baseflow as well seems strange (should be ewpert knowledge?). Hence, the question is how far this 'agnostic' idea would work and as the aim is to enable domain knowledge in this package, the datatypes we develop should probably be specifically designed for discharge (or another datatype for specific for rain).

As a first braindump: I would think of a following setup:

HydroAnalysis -> general 'base' class (on which the other are built) containing a pd.DataFrame inside the data attribute of the class and different metadata/bookkeeping/other attributes, cfr. the current HydroAnalysis. The most important rule would be to keep a single type of variable in one hydroanalysis instance (either flow, rain,...). The idea is to keep this base class agnostic of the domain (rain, flow) and the functions it provides would be useful to ALL types of data. I'm thinking of the current subset-selections like peaks, seasons, other time-fragments,... The different columns would - in my opinion - still be different station identifiers, but I'm very open to discussion on this point. The reason I'm thinking of this is just experience from previous projects: I collect time series of a variable for a certain period from different stations in a dataframe and I want to perform the same types of analysis (peaks, baseflow,...) on each of them (each column). By structuring it like this, we gain from the full Pandas power. On the other hand, when only one station, everything keeps working, it just is a single column. I can be wrong, but the support for Panels is I think decreasing in Pandas. Additionally, we could keep the station names in the class as an attribute to make this more clear + it would be possible for the user to drag with it other data (e.g. metadata on the measurement conditions) that is not affected by the analysis, but can be used by the user for ad-hoc/custom operations he/she want to perform (keeping user freedom beyond package functions). In short: HydroAnalysis: base class, contains 1 variable type, columns represent stations
FlowAnalysis: A class that inherits on the HydroAnalysis, but that adds methods that are only relevant for flow (discharge). So, apart from the general stuff from HydroAnalysis, flow functions are added here and also additional atrributes, e.g. this class contains an additional DataFrame (with the same station name headers) containing the baseflow.
RainAnalysis: cfr flowanalysis, ut specificly for rain,...
*Anotherthing*Analysis: ...
Analysis: An analysis typically contains multiple types of variables (flow, rain, chemical variables,...). So inside the Analysis different variables are stored and managed + the interrelation of these stations. In other words, it aggregates FlowAnalysis, RainAnalysis,...

Some methods will need both flow and rain, and these could be specified on the Analysis level.

As I said, more an opinion for the moment, looking forward to feedback on this. When doing it like this, we could for the moment focus on HydroAnalysis and FlowAnalysis and create sufficient methods for these.

An interesting point you make is the dailymean, and realtime as different examples for storing. Actually, I do think we should not make this as different custom datatypes, as this kind of logic is highly captures by Pandas itself. The frequency is stored inside a DateTimeIndex. Moreover, Pandas makes the difference between instantaneous TimeStamps and Period (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#overview). So, using DataFrames basically supports both as such. However, some methods won't work on all frequencies and all spans. Hence, I think we should provide this kind of 'logic' to the methods (e.g. some decorators that can be reused to specify and check if the current timeseries frequency/characteristics comply to the method requirements.

What do you think?

stijnvanhoey commented 8 years ago

@mroberge, with respect to the usage of Panels: https://github.com/pandas-dev/pandas/issues/13563

mroberge commented 7 years ago

Sorry that I've been away... I like your idea. After stepping away for a while and coming back to the problem, I have a much stronger feeling that keeping everything simple is the best approach. My system tried to automate too many things. It is better to trust the user, and let them decide what they want to do with the software.

As you say above, how should we organize the dataframe? Each column could either be a different variable or a different station. By having each column be a different station, you could perform a baseflow separation on the dataframe and produce a new dataframe with baseflow, and each column corresponds to a different station from the original dataframe. I like this approach because:

there is less stuff for us to keep track of (like you said above)
the dataframe doesn't need to know what it is holding. It can hold any type of timeseries.
this is more like a functional programming approach. Each function we run on a dataframe returns a new dataframe instead of modifying the dataframe that we send it. For example, a functional approach to baseflow separation would return a baseflow dataframe. My previous (bad) idea was to modify the dataframe by adding a new baseflow column. This is bad. What happens if you try to do another baseflow separation? How do you tell the function not to try to separate the column with the baseflow? Bad.

As far as the different Analysis ( FlowAnalysis and RainAnalysis), perhaps we just keep them all together for simplicity's sake for now. Maybe later we can make different types of analysis available for different types of data, but for now I regret ever trying to save the user from making bad choices.

mroberge commented 7 years ago

I just fixed the TimeSeries issue, but before I close it, there is a lot of discussion going on here that should be preserved somehow...

stijnvanhoey / hydropy

convert TimeSeries into Series #10