Create a new data structure

mroberge commented 6 years ago

The current data format is to put everything into a single large dataframe, with each station getting two columns, one for discharge, one for data flags. Odd columns contain data from different stations, Even columns contain the corresponding data flags for that site.

The PROBLEM with adding all of these dataframes to a single large dataframe using pd.concat is that sites that collect less frequently will get padded with NANs for all of the time indecies that they don't have data for.

A second problem is that every other column will have data, and the the other columns will have flags. There is no simple way to select only the data columns except to take the odd numbered columns.

A POSSIBLE SOLUTION: create a data structure that is composed of stacked dataframes. Each data frame will correspond to a single site. The first column will correspond to discharge, the second to flags, and any others can be derived values like baseflow or other measured parameters. The dataframes will be stacked, and be part of an object that allows you to select by a range of dates, by sites, and by the type of column. In this respect, it might be similar to XArray, except that package requires their n-dimensional structures to all be the same datatype.

hydrofunctions.py extract_nwis_df(response_obj): is where the USGS json gets processed. Correcting this would be a relatively simple fix: you would simply duplicate this function and have it collect all of the dataframes into an array instead of doing a pd.concat() at the end with each new dataframe.

mroberge commented 6 years ago

This is related to Issue #18.

jdhughes-usgs commented 6 years ago

@mroberge How about returning a dictionary of dataframes that just store the actual data (no NaNs)? This may make implementation of the baseflow separation methods easier than it would be with the current dataframes.

Maybe the dataframe should just store the data (no flags) and have a method to return the flags for a column(s).

This is something I can work on over the next few weeks.

mroberge commented 5 years ago

My latest idea: Keep the current structure, but improve the methods for accessing subset of the data.

The current structure:

everything in a single giant dataframe
rows are a time index
columns have long, complicated names that include data for different sites, different parameters, and contain data as well as quality flags.

The advantages of this structure is that it can be saved as a parquet file, which allows for high compression, large datasets, and fast computation. Also, this structure does a great job storing everything from the NWIS.

The disadvantages are that you can't use some of the convenience functions that Pandas provides. For example, you can't take the whole dataframe and go: my_dataframe.plot() and produce anything useful.

New idea for a solution: Think of the data as a four-dimensional data structure: time (stored as rows) and three dimensions that are stored as columns: site, parameter, and content. Each table can have multiple sites; each site can have multiple parameters, and every parameter has two types of content: the actual measurement and the data quality flag.

Most analyses only use two dimensions, so a flat, two-dimensional dataframe is convenient. Time always goes into the rows, but we put different things into the columns depending on what we want to do. Three examples:

To do an analysis of a flood moving down a river, you might want a table of stream discharge data, where each column contained the data for a different site.
To analyze ground water recharge, you might want a table of data for a single site, where each column represents the data for a different parameter.
To analyze the quality of a dataset, you might want a table of a single parameter at a single site, where each column represents the data and the quality flags.

My idea for simplifying access

Use tuples and a multiindex to access the columns that you want.
To use a subset of your full table, you would specify: time, site, parameter, and content.
an example 'slice' of the full dataset:
- my_full_sized_dataframe.loc('2016-01-01':'2017-05-01', (['01582500', '01581000'], 'discharge', 'data'))
- This pulls out a slice of data that includes 17 months of observations from the full set
- the slice includes two sites (01582500 and 01581000)
- it only grabs the discharge parameter
- it requests the data, not the flags.
- this returns a simple dataframe with two columns, each representing a different site.

Most people wouldn't want to work with a full, complicated dataframe for their analysis, but it is still important to save all of the data quality flags, and to have the flexibility to work with more than one site and more than one parameter. So, the NWIS object will store all of this data internally as a multiindex dataframe, and users request a slice of what they want using either the complicated multiindex directly, or they can use some convenience methods built into NWIS.

For example, NWIS.df('discharge') could output a dataframe of discharge data that defaults to include all of the rows, all of the sites, but with no flags. NWIS.df(site='01581000', parameter= ['discharge', 'stage']) would slice the full table down to data for two parameters at a single site.

jdhughes-usgs commented 5 years ago

@mroberge have you started on this yet?

mroberge commented 5 years ago

No, unfortunately.

jdhughes-usgs commented 5 years ago

@mroberge I will take a stab at starting this tomorrow. Will let you know how it goes.

mroberge commented 5 years ago

One task that is related to this is that the parser that you wrote needs to be tested. It does such a complex job of going through the JSON and processing all of those different nested keys... I guess I've been procrastinating!

jdhughes-usgs commented 5 years ago

@mroberge have you tested the slicing syntax that you suggested above (my_full_sized_dataframe.loc('2016-01-01':'2017-05-01', (['01582500', '01581000'], 'discharge', 'data'))?

I was able to get something to work if I include the station information as an index but not as a column in the MultiIndex dataframe. Is this what you were thinking of?

One concern I have with the MultiIndex dataframe is the complex syntax required to work with the data. For example, if the current name for a dataset (for example, 'USGS:01646502:00060:00003') is used as the station name then to access a single station you need something like:

df.loc[('USGS:01646502:00060:00003', slice('2018-01-06', '2018-01-10')), 'value']

to get the data.

mroberge commented 5 years ago

@jdhughes-usgs, Sorry it took me so long to respond.

I created a notebook that plays around with using a multi-index. I could email it to you, or maybe set up a branch and post it there.

One thing that occurs to me is that this data structure would just be internal to the NWIS object, and people could interact with the NWIS object to get the dataframe they want. So, for example, you load up with a big request from the USGS, then you want to run a quick function to count how many 'Provisional' flags there are, so you request a subset dataframe that just includes the meta data. Then you want to plot the discharge over time for two of the sites you requested, so you request that dataframe.

Because users are interacting with the NWIS object and not the giant, complex dataframe, we can write little convenience functions that make slicing easier. These would make a few assumptions about the request so you don't have to spell it out every time. For example, only provide the metadata if the user specifically asks for it. Always provide all of the sites if no sites are specified. If there is only one parameter, then you don't have to specify which parameter you want.... that sort of thing.

I originally thought that a multiindex would make slicing easier, because you wouldn't have to specify every parameter every time, but it is not so simple. You can leave out some parameters some of the time...

An alternative would be to just make three rows of column names and use matching. One row would have the site number: '01646502'; one site would have the measured parameter: '00060', and one would state the content type: 'data' or 'meta'.

About the station names: I got this idea from the really long column names that you currently have set up. If you think about it, these names can be broken down at the hyphen into different useful parts. The actual station IDs are just eight characters long usually, then you have the parameter that is being measured at the site, and then you have the statistic that is being reported, like the raw data, or the daily mean.

mroberge commented 5 years ago

One solution that I am starting to favor is to add some arguments to the NWIS.df() method. If you simply call .df() with nothing, then you'll get exactly the same dataframe that you currently get. However, if you provide certain arguments, then .df() will slice the data and provide it in a nice format.

Right now, the NWIS has about four dimensions for all of the data it serves:

the time slice, defined by the start and stop time.
the sites
the parameterCd like discharge, stage, groundwater height, etc.
the type (I don't know what to call this) the column either contains flags or data.
I guess the NWIS also provides different statistics, so maybe this is the fifth dimension.

If you ignore the statistics dimension for now, and assume that the user wants all of the time data that they requested, then you have three dimensions to choose from: sites, parameter, & type.

Users could specify two out of the three dimensions, and the third dimension will create the columns of the new dataframe. And we could further simplify this by assuming that the user only wants the flags if they ask for it.

my_data = hf.NWIS( ['01541000', '01541200'], period='P30D').get_data()

This will return a dataframe with 8 columns: two sites x two parameters (stage & discharge) x data & flags my_data.df()

This will return a dataframe with two columns: the discharge at the two sites. my_data.df('00065')

This will return a dataframe with two columns: all of the parameters for site '01541200' my_data.df('01541200')

specify 'flags' if you want to see the metadata. my_data.df('flags') # I guess this would provide everything: four columns of qualifiers (two sites x two parameters)

my_data.df('01541000', '00060', 'all') # This would provide the data column and the accompanying flag column

mroberge commented 5 years ago

I think this issue is close to being put to rest.

The new solution is to store the complete dataframe internally, as NWIS._dataframe. You can then request all of the discharge data by asking for Q_df = myNWISobj.df('discharge'). Or, if you have the data for site #01581200, you can put all of it into a dataframe like this: myNWISobj.df('01581200').

You can use more than one parameter too. To see the qualifier flags for the stage data at site 01592222, do this: myNWISobj.df('00065', 'flags', '01592222'). You can enter the arguments in any order. The qualifier flags only get returned if you ask for them.

See the feature-df-select branch for progress on this issue.

mroberge commented 5 years ago

@jdhughes-usgs I've made some progress!

The original problem was that if you request data that gets collected every hour along with requesting data that gets collected every 15 minutes, you'll force the hourly dataset to have the 15 minute index.

My solution was to add a parameter that allows you to interpolate for the new values by default, or add NaN if you don't want interpolation. Either way, a new qualifier flag is added so that you know that a value is interpolated. Also, a warning occurs when a dataset is 'upsampled'.

Closed with pull request #44

mroberge / hydrofunctions

Create a new data structure #26