stijnvanhoey / hydropy

Analysis of hydrological oriented time series.
https://stijnvanhoey.github.io/hydropy/
BSD 2-Clause "Simplified" License
56 stars 31 forks source link

convert TimeSeries into Series #10

Open mroberge opened 8 years ago

mroberge commented 8 years ago

Apparently Pandas has depreciated TimeSeries, but all of the functionality is contained in Series.

mroberge commented 8 years ago

I should also modify get_usgs to return data as a Series instead of a DataFrame. the 'name' attribute should be able to act as a key (dv01585200 or iv01585200, for example) to match with filenames if the data gets saved, or to match up with metadata that might get collected, or to match up with a baseflow timeseries for the same site.

stijnvanhoey commented 8 years ago

Would it be an option to work with DataFrames as base for the hydropy datatype (currenty the .data of the attribute is a dataframe) of the HydroAnalysis class?

Henc, a dataframe as the .data attribute provides the option (but also difficulty) to dervive infromation of multiple gauges time series at the same time (e.g. get_peaks should provide the peaks for all of them).

Or should we opt for a Series as base, making all the methods work for a single gauge time series and make a separate class for the 'multiple-gauges' option?

mroberge commented 8 years ago

This is a good question! It never occurred to me to use the Series, so when I saw your usage of it, I thought it was such a good idea that I immediately wanted to use it too.

I've also been thinking that it would be nice to have a data format that allowed you to save the baseflow time series along with the discharge timeseries. But that leads to the question: Should the discharge and baseflow each make a column in a dataframe, and then each additional site could add an additional dataframe to create a Panel? Or should the baseflow go into the Panel, and each site is a new column in a dataframe?

I think to answer this we would have to try each, and see which is easiest to implement, and which is easiest for the user. If I can find some time, I have started playing around with this idea. I will post a branch as soon as I can.... ⌚️ ☹️

mroberge commented 8 years ago

I found some time to work on this some more. Here is the solution I am trying: Analysis is a new class that holds multiple sites. I do this in two ways (we can choose which is best, or keep them combined)

  1. a list of Stations (see below)
  2. a pandas Panel. Each 'item' is a dataframe that corresponds to a single Station. Station is a new class that holds data for a single site. To keep different types of data separate, I started with two types of data: dailymean, and realtime (instantaneous values collected every hour or less) HydroAnalysis is the class that you created to hold a DataFrame. I use this to hold dataframes inside of the Station class.
stijnvanhoey commented 8 years ago

Some first thoughts on this, but I think it needs to destilate a bit more:

As a first braindump: I would think of a following setup:

Some methods will need both flow and rain, and these could be specified on the Analysis level.

As I said, more an opinion for the moment, looking forward to feedback on this. When doing it like this, we could for the moment focus on HydroAnalysis and FlowAnalysis and create sufficient methods for these.

An interesting point you make is the dailymean, and realtime as different examples for storing. Actually, I do think we should not make this as different custom datatypes, as this kind of logic is highly captures by Pandas itself. The frequency is stored inside a DateTimeIndex. Moreover, Pandas makes the difference between instantaneous TimeStamps and Period (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#overview). So, using DataFrames basically supports both as such. However, some methods won't work on all frequencies and all spans. Hence, I think we should provide this kind of 'logic' to the methods (e.g. some decorators that can be reused to specify and check if the current timeseries frequency/characteristics comply to the method requirements.

What do you think?

stijnvanhoey commented 8 years ago

@mroberge, with respect to the usage of Panels: https://github.com/pandas-dev/pandas/issues/13563

mroberge commented 7 years ago

Sorry that I've been away... I like your idea. After stepping away for a while and coming back to the problem, I have a much stronger feeling that keeping everything simple is the best approach. My system tried to automate too many things. It is better to trust the user, and let them decide what they want to do with the software.

As you say above, how should we organize the dataframe? Each column could either be a different variable or a different station. By having each column be a different station, you could perform a baseflow separation on the dataframe and produce a new dataframe with baseflow, and each column corresponds to a different station from the original dataframe. I like this approach because:

As far as the different Analysis ( FlowAnalysis and RainAnalysis), perhaps we just keep them all together for simplicity's sake for now. Maybe later we can make different types of analysis available for different types of data, but for now I regret ever trying to save the user from making bad choices.

mroberge commented 7 years ago

I just fixed the TimeSeries issue, but before I close it, there is a lot of discussion going on here that should be preserved somehow...