lter / Clim-HydroDB-2.0

Material related to converting the original climHydroDB into CUAHSI ODM
8 stars 3 forks source link

support multiple datavalues tables #24

Open twhiteaker opened 2 years ago

twhiteaker commented 2 years ago

You could adopt a naming convention, like DataValues_X.csv, where X is however you want to split your tables, e.g., DataValues_pH.csv and DataValues_Temperature.csv, or DataValues_2000s.csv and DataValues_2010s.csv. You'll also need a DataTable column added to SeriesCatalog to store the name of the DataValues table that a given time series can be found in. You are not allowed to split a series across DataValues tables.

twhiteaker commented 2 years ago

We probably need to restrict how things can be split into separate DataValues tables to prevent a mess. In ODM terms, a series is the set of values with the same site, variable, quality control level, source, and method. So, we could recommend using any of those five axes to determine how to split data. The only other way I think we should allow is by time range. Users can define the time range, but the idea is to store temporally contiguous, non-overlapping chunks of data in the data tables.

As an example of what else COULD be done not what I think SHOULD be done, one could split by data qualifier, so that all "good" data are one table and all data flagged as "possible sensor malfunction" go in another table. You could make similar cases for offsets, censor codes, etc. I think there isn't a lot of benefit to the user with this, and it would make code for generating series catalogs, or taking what we do and creating truly ODM CSV compliant tables, more complicated.

kzollove commented 2 years ago

I think we should define what users CAN do (fits in our framework and should be supported in https://github.com/EDIorg/hymetDP) and what they SHOULD do (best practices).

I vote we should allow splits along the five core axes (site, variable, quality control level, source, and method). Further, we should suggest a preference for order. I would suggest first by Site then by Variable, finally methods. I would suggest not splitting across QC Level or Source unless very prevalent.

While supporting some of the optional tables may make sense, I think we should draw the line at core axes for now and reconsider later if we need to shrink tables down further.

I like the naming convention proposed above, though I would propose an intentionally vague/general specification: DataValues{firstAxisAndID}{nextAxisAndID} Examples: DataValues_Site1, DataValues_Site2, etc. if split across one axis. DataValues_Site1_Variable1, DataValues_Site1_Variable2, etc. for more than one axis.

Having the generic "Site1" could prove to be more extensible down the line then having DataValues_. But there could be some better compromise between descriptive and generic.

gremau commented 2 years ago

I'm in agreement with all this, and I think Kyle's more generic DataValue table naming convention would work well.

One thing to clarify here is that it isn't going to make the finished datasets any smaller (I don't think), it just chunks the series up into more, smaller pieces. This will make it possible for a user to upload/download more manageable subsets of the data, which is good, but there is still the possibility that these datasets will be quite large.

kzollove commented 2 years ago

Maybe it's a good idea to suggest a maximum table size and a maximum data package size?

For data package, I would suggest maximum around 500 MB, which is the EDI Data Portal upload limit (without direct download links). I'm not sure this is a reasonable suggestion as of right now, just from seeing how big the packages turn out even when split. It would require making multiple packages (sometimes many) for a single source.

One positive of that approach is that you then have very atomized hymetDP datasets. If I just want Site1 and Variable3 from an Andrews Forest dataset, I can download an entire single package and not have to worry about filtering.

As Greg mentioned before, this would result in many data packages. From an EDI perspective, this is not necessarily bad. From a hymetDP perspective, I also don't think this is too bad. Perhaps from an LTER IM perspective this poses a management problem which I could understand.

gremau commented 2 years ago

The issue with adding more packages is just that your search tools need to make it easy to filter the larger pool of packages. If I search for "meteorology" data at my site, and the search returns ten data packages, that is easier to understand and filter than if the search returns 150 data packages. JRN already has this problem to some extent - we've got over 200 data packages in EDI from our weather stations, and I think it can be difficult for a casual user to find what they are looking for. Its not hard for me because I know what is available and what search terms to use, but I need to help people find things if they don't have this background knowledge.

With HyMet this is probably a problem we could tackle with a registry of HyMet datasets or something. And it is certainly possible to devise improved search tools for casual users, or to keyword packages such that they are findable. We should probably discuss what that would look like though.

twhiteaker commented 2 years ago

CUAHSI keeps a registry of datasets. It's called HIS Central. Datasets registered there have their metadata harvested to facilitate the searches that happen at data.cuahsi.org. All this is to say, the registry idea has proven useful for other projects and I think it'd be useful here.

kzollove commented 2 years ago

From the start, hymetDP should be pretty searchable across the ODM CV terms within the tables. You can also use a text, geographic, or temporal filter. That's all built-in from day one. I imagine that could narrow down results pretty well.

In the future, semantic annotation search could help as well.

kzollove commented 2 years ago

@twhiteaker @gremau I have listed and explained three possible solutions to this problem in this slideshow.

I'd appreciate both your input (if you think there are other solutions, if you think any of these solutions would or would not be suitable). Feel free to create new slides in the slideshow, or we can discuss in this thread (referring to solutions as S1, S2, or S3).

twhiteaker commented 2 years ago

In a nutshell, I think you need some flavor of all three solutions. You have to allow a single L0 to become multiple L1s because of the size issue.

I think you also should allow multiple data values tables in a single package, as that would make it easier to retrieve and read from just the data you need. The SeriesCatalog table will need a column to store the filename of the DataValues table which contains the series described in a given row.

When it comes to how you split things up, I agree that splitting by site and/or variable is fine and that we shouldn't recommend splitting by the other axes. I also wonder if you need to allow splitting by time. I wonder if any sites have an example of a time series of a single variable at a single site having so many values that it exceeds our size recommendations. Hopefully not, because splitting by time could make things messy. It might be harder to identify longer term datasets if datasets are split by time. A desired query I heard back in when CUAHSI was just getting started was, "Show me where I have long term time series of both stream discharge and nutrient concentrations at the same location." I'm hoping whatever we come up with can support that kind of query.

My memory is failing me...why are we including ValueID? It's used in ODM for grouping values or showing provenance, e.g., this daily value is an average derived from these hourly values. Is such functionality needed for our HyMet data? I suspect not, which means we can drop ValueID. I also wonder if we'll ever need OffsetValue, OffsetTypeCode, and CensorCode.

Regarding S3, one question I have is, do we use an ancillary table or the SeriesCatalog table. I suspect we'll wind up using a modified version of the SeriesCatalog table. Since it sounds like we won't be producing vanilla ODM CSV, we might as well roll with it and do what we need for efficiency, and rely on the R package to convert to vanilla ODM if desired. So, utilize that SeriesCatalog table, and called it SeriesID. As to which fields to drop from DataValues, it might be time to poll our stakeholders to see what fields they would need. I think you'll need QualifierCode in there, but hopefully not offsets.

How does search work for HyMet? Is there going to be a HyMet server that keeps its own summary of what's available in EDI with an API for searches? Or, does searching happen through the R package, something like this:

  1. User supplies area of interest, date range, and variables of interest.
  2. R package finds HyMet datasets in EDI whose EML metadata matches the search parameters.
  3. User selects datasets of interest.
  4. R package downloads the SeriesCatalog table for each package and presents series to the user. The R package would group rows in the SeriesCatalog together. For example, if we store UTCOffset in that table, no series will be longer than about half a year due to daylight savings, but the user probably wouldn't care about that, so the R package would compute it's own "grouped" series catalog on the fly just on the core axes and present that to the user.
  5. User selects time series of interest. R package downloads the appropriate DataValues table and loads the data into memory.

Since the SeriesCatalog is a necessary window into a dataset, I think we want to keep it from unwieldy. It might be good at this point to poll our stakeholders to figure out which columns change frequently. If for example offset values are all over the place in a given time series, they should be in DataValues and not SeriesCatalog. Streamflow should be pretty clean to work with, but I don't know about other data types that HyMet would support.

gremau commented 2 years ago

Good summary of the options, Kyle.

I think I'm in agreement that we might need to consider using more than one option here, but perhaps there should be a sort of decision tree that the HyMet R package could trigger as one is preparing an L1 dataset. For example, if the DataValues table is estimated to be larger than 100mb (or less?), then maybe we could ask the user to split the DataValues table, suggesting site or variable first, into separate tables (S2). Then, if the size of the entire L1 dataset is estimated to be larger than 500mb, suggest splitting into multiple L1 datasets by site or variable (S1).

I think option S3 is also really smart, but I am a little concerned that it is the least ODM 1.1 compatible. It would be a slight extra headache to convert back to ODM compatible format, and if a researcher, site, or other HyMet user wanted to upload data to CUAHSI, potentially that is a problem. This is also a possible issue with S2, but simpler to manage since the individual DataValues tables are, I think, still ODM compatible except for the filenames.

The search issue is going to be something we need to reckon with in any case, but it becomes more challenging as L0 data tables get split into additional L1 tables and datasets to keep sizes manageable. I think we need to decide if all HyMet datasets need to be hosted at EDI, or if other repositories (which we have discussed I think) are possible. Even with ecocomDP I think there is a listing of datasets archived at EDI, which is maintained separately from the datasets themselves and the ecocomDP R package reads this somehow (could be wrong about this....). If we allow other repositories to store HyMet datasets, then this registry idea becomes more important even.

So, it seems like we have two sets of questions:

  1. Where will HyMet datasets go and what kind of HyMet search do we need? If this data search requires a registry or another system separate from the HyMetDP R package itself (such as I believe ecocomDP uses) then what does that look like?
  2. How do HyMet's stakeholders (Hymet dataset creators or users) envision publishing their data? I.e, are offsets common, do they want multi-site data packages, how big are the datasets? etc

To get some answers, I think we should probably set up a meeting with the EDI HyMet group first, for question 1, then with HyMet stakeholders for question 2.

Does that sound ok?

kzollove commented 2 years ago

Hi @gremau @twhiteaker,

We should probably lock down the answer to Greg's questions ASAP, sounds like a meeting with EDI hymet group makes sense to start with.

For now, maybe we agree to sideline the S3 option. As we've all noted, in brings in a new level of complexity (and a lot of potential confusion)

Search Data Function Expanation

I will try to shed some light on this topic. Essentially hymetDP copies the search/index patterns from ecocomDP. An EDI data package holds a list of data packages (see example in staging portal). The search function then imports the package (once per session that it is used).

hymetDP can currently search for:

Free text (in abstract and title) VariableName GeneralCategory SiteType SampleMedium TimeSupport (frequency) temporal coverage geographic coverage

Since we have no finalized hymet datasets in EDI, and only a few in staging, there's really no way to test the search feature. You can get a feel for it by using the ecocomDP search.

FYI, we can extend searchability easily to other CV variables. If you think one of the important ones is missing just let me know.

cgries commented 2 years ago

I may not have read every word here, so, I apologize if I repeat anything. I agree that S1 is the best solution. However, I would include the option of splitting a dataset by year (time). This would make the most sense for our lake temperature profile data. Except for maybe the surface temperature, each temperature timeseries at a certain depth would not be used separately from the others. There may be other cases where it makes more sense to keep parameters together for one site.