Open mroberge opened 8 years ago
I should also modify get_usgs to return data as a Series instead of a DataFrame. the 'name' attribute should be able to act as a key (dv01585200 or iv01585200, for example) to match with filenames if the data gets saved, or to match up with metadata that might get collected, or to match up with a baseflow timeseries for the same site.
Would it be an option to work with DataFrames as base for the hydropy datatype (currenty the .data of the attribute is a dataframe) of the HydroAnalysis class?
Henc, a dataframe as the .data attribute provides the option (but also difficulty) to dervive infromation of multiple gauges time series at the same time (e.g. get_peaks should provide the peaks for all of them).
Or should we opt for a Series as base, making all the methods work for a single gauge time series and make a separate class for the 'multiple-gauges' option?
This is a good question! It never occurred to me to use the Series, so when I saw your usage of it, I thought it was such a good idea that I immediately wanted to use it too.
I've also been thinking that it would be nice to have a data format that allowed you to save the baseflow time series along with the discharge timeseries. But that leads to the question: Should the discharge and baseflow each make a column in a dataframe, and then each additional site could add an additional dataframe to create a Panel? Or should the baseflow go into the Panel, and each site is a new column in a dataframe?
I think to answer this we would have to try each, and see which is easiest to implement, and which is easiest for the user. If I can find some time, I have started playing around with this idea. I will post a branch as soon as I can.... ⌚️ ☹️
I found some time to work on this some more. Here is the solution I am trying: Analysis is a new class that holds multiple sites. I do this in two ways (we can choose which is best, or keep them combined)
Some first thoughts on this, but I think it needs to destilate a bit more:
get_baseflow
function operating on HydroAnalysis
when you can actually store baseflow as well seems strange (should be ewpert knowledge?). Hence, the question is how far this 'agnostic' idea would work and as the aim is to enable domain knowledge in this package, the datatypes we develop should probably be specifically designed for discharge (or another datatype for specific for rain). As a first braindump: I would think of a following setup:
HydroAnalysis
-> general 'base' class (on which the other are built) containing a pd.DataFrame inside the data
attribute of the class and different metadata/bookkeeping/other attributes, cfr. the current HydroAnalysis
. The most important rule would be to keep a single type of variable in one hydroanalysis instance (either flow, rain,...). The idea is to keep this base class agnostic of the domain (rain, flow) and the functions it provides would be useful to ALL types of data. I'm thinking of the current subset-selections like peaks, seasons, other time-fragments,... The different columns would - in my opinion - still be different station identifiers, but I'm very open to discussion on this point. The reason I'm thinking of this is just experience from previous projects: I collect time series of a variable for a certain period from different stations in a dataframe and I want to perform the same types of analysis (peaks, baseflow,...) on each of them (each column). By structuring it like this, we gain from the full Pandas power. On the other hand, when only one station, everything keeps working, it just is a single column. I can be wrong, but the support for Panels is I think decreasing in Pandas. Additionally, we could keep the station names in the class as an attribute to make this more clear + it would be possible for the user to drag with it other data (e.g. metadata on the measurement conditions) that is not affected by the analysis, but can be used by the user for ad-hoc/custom operations he/she want to perform (keeping user freedom beyond package functions). In short: HydroAnalysis
: base class, contains 1 variable type, columns represent stationsFlowAnalysis
: A class that inherits on the HydroAnalysis, but that adds methods that are only relevant for flow (discharge). So, apart from the general stuff from HydroAnalysis, flow functions are added here and also additional atrributes, e.g. this class contains an additional DataFrame (with the same station name headers) containing the baseflow.RainAnalysis
: cfr flowanalysis, ut specificly for rain,...*Anotherthing*Analysis
: ...Analysis
: An analysis typically contains multiple types of variables (flow, rain, chemical variables,...). So inside the Analysis different variables are stored and managed + the interrelation of these stations. In other words, it aggregates FlowAnalysis, RainAnalysis,... Some methods will need both flow and rain, and these could be specified on the Analysis level.
As I said, more an opinion for the moment, looking forward to feedback on this. When doing it like this, we could for the moment focus on HydroAnalysis and FlowAnalysis and create sufficient methods for these.
An interesting point you make is the dailymean, and realtime as different examples for storing. Actually, I do think we should not make this as different custom datatypes, as this kind of logic is highly captures by Pandas itself. The frequency is stored inside a DateTimeIndex. Moreover, Pandas makes the difference between instantaneous TimeStamps and Period (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#overview). So, using DataFrames basically supports both as such. However, some methods won't work on all frequencies and all spans. Hence, I think we should provide this kind of 'logic' to the methods (e.g. some decorators that can be reused to specify and check if the current timeseries frequency/characteristics comply to the method requirements.
What do you think?
@mroberge, with respect to the usage of Panels: https://github.com/pandas-dev/pandas/issues/13563
Sorry that I've been away... I like your idea. After stepping away for a while and coming back to the problem, I have a much stronger feeling that keeping everything simple is the best approach. My system tried to automate too many things. It is better to trust the user, and let them decide what they want to do with the software.
As you say above, how should we organize the dataframe? Each column could either be a different variable or a different station. By having each column be a different station, you could perform a baseflow separation on the dataframe and produce a new dataframe with baseflow, and each column corresponds to a different station from the original dataframe. I like this approach because:
As far as the different Analysis ( FlowAnalysis
and RainAnalysis
), perhaps we just keep them all together for simplicity's sake for now. Maybe later we can make different types of analysis available for different types of data, but for now I regret ever trying to save the user from making bad choices.
I just fixed the TimeSeries issue, but before I close it, there is a lot of discussion going on here that should be preserved somehow...
Apparently Pandas has depreciated TimeSeries, but all of the functionality is contained in Series.