openaq / project-metadata-format

This repo houses project discussions for building a metadata format for the metadata editor.
2 stars 0 forks source link

metadata schema sketch #7

Open sethvincent opened 5 years ago

sethvincent commented 5 years ago

Based on comments in this repo so far I've put together a quick sketch of what a metadata object might look like.

It's annotated with the links for relevant issues in this repo.

I'll update this issue with revisions as we discuss and decide on the specifics of the various attributes.

{
  id: '', // unique station id https://github.com/openaq/project-universal-stationID/issues
  maintenance: '', // maybe a set of options like daily/weekly/monthly/as needed https://github.com/openaq/project-metadata-format/issues/6
  siteDescription: '', // where is it placed, is unusual weather common, or other description https://github.com/openaq/project-metadata-format/issues/6
  installationDate: '', // ISO timestamp https://github.com/openaq/project-metadata-format/issues/6
  stationHeight: Number, // in meters? https://github.com/openaq/project-metadata-format/issues/4
  instruments: [ // https://github.com/openaq/project-metadata-format/issues/3
    brand: '',
    model: '',
    installationDate: '', // ISO timestamp
    calibrationProcedures: ''
  ]
}
RocketD0g commented 5 years ago

Here's are some suggestions:

Edit: Anyone checking out this issue thread: also feel free to check Issue #8, which just came in with some feedback that has not been integrated yet into this specific issue thread (though also note, some of the suggestions are already captured in our existing data format. That said, we may want to consider having several of those existing data parameters in our metadata format too. For instance, you can imagine someone may report the existence of an existing station but we may not have pollution measurements. It'd still be valuable to have the coordinates, etc.

jflasher commented 5 years ago

I don't have much input on what specific metadata to be included, but a more general comment. I am generally all for 'simple is better' which has served us well I think for the measurement data format but I would like to point out that here we are down the line creating a new standard to store other data. The point being that if we had stored more information in the original measurement format, would we need to do this work now? It may be the case that the measurement and data formats would always be separate, but just wanted to point out that excluding things now may lead to more work down the line.

Also, if we're going to duplicate data between the measurement and station data, we should think about how syncing will work. Which source is the source of truth? If data is updated in one place, how does it get updated in the other?

RocketD0g commented 5 years ago

@jflasher, this is a good point on perhaps amping up the metadata format.

Input mentioned by Robert Rohde in Issue #8 also made me think we probably should duplicate (and pull in for existing station locations) several of the metadata already in the measurement data format.

The reason being: It'd be nice to have all metadata information about a given location - or the suite of all locations - not split between two systems: the metadata format and the data format.

For instance, imagine this scenario: Someone wants to add station data for monitors in Ghana via the metadata editor. Perhaps all they can add are the pollutant types and the coordinates of the station. It'd be a shame if we couldn't capture that. It'd also seem a shame if someone else who wants to find out what stations exist in Ghana has to know to ping both the regular API and the metadata API to get the full set of information, no? Are there other thoughts on this?

In the following comment, I'll post an updated format.

RocketD0g commented 5 years ago

Here's a proposed metadata format incorporating metadata parameters from our existing data format and the changes others and I suggest above, as well as taking @jflasher's comment to heart about perhaps for once we do something in not the simplest format possible. :)

openaq-metadata-format (WIP)

A description of the working metadata format provided by the OpenAQ Platform.

Station-Level Information

Field Type Required Description Comment/Q
stationID Number Assigned by OpenAQ https://github.com/openaq/project-universal-stationID/issues
stationName String Unique location name of the station This is pulled from location in the existing data format and is the originating source-designated name. Seems like a good idea if we are to have station ID we should have a stationName.
stationPollutants String The measured parameter; acceptable values are pm25, pm10, co, bc, so2, no2, o3 Stations will often measure more than one type of pollutant. This info is already included for stations in the system in our existing measurement data format. Also: Shall we include more pollutant types than what the we currently ingest in OpenAQ? CO2, CH4, SOx, benzene?
city String City (or regional approximation) containing location This info is already included for stations in the system in our existing measurement data format Do we want to keep this in the metadata format? I know there was some controversy. @jflasher
stationAltitude Geospatial altitude of station coordinates in meters
country String Country containing location in two letter ISO format This info is already included for stations in the system in our existing measurement data format.
sourceType String The type of source; acceptable values are: government, research, other This info is already included for stations in the system in our existing measurement data format.
coordinates Object Location of measurement This info is already included for stations in the system in our existing measurement data format.
attribution Array Data attribution in descending order of prominence [{"name": "TCEQ", "url":"http://www.tceq.state.tx.us"}, {"name": "City of Houston Health Department"}]
mobile Boolean Indicates whether the measuring station is stationary or mobile Should we keep this? I think so.
instrumentNumber Number Number of instruments registered in the OpenAQ system to this station Comments on this? Basically, we need a way to label multiple instruments with different metadata of their own and measuring multiple pollutants at a given station.
stationStart Object When did station first begin operating, if known?, ISO timestamp
stationActive String True, False, Unknown options. Is the station still active?
deactivatedStationDate Object If the station is no longer active, when did the station stop operating?, ISO timestamp
otherStationNotes String Any other relevant notes about this station?

Instrument-Level Information:

For a value of 'n ' retrieved from instrumentNumber= n, a corresponding number of instrument[n] fields need to be created. In each instrument field, the following instrument metadata are requested:

Field Type Required Description Comment/Q
instrumentPollutants String The pollutant parameters measured by the instrument; acceptable values are pm25, pm10, co, bc, so2, no2, o3 Similar to question in the above Station-Level: Do we want to make it possible to include other pollutant types?
instrumentType String We could come up with a list of possibilities, but I'm tempted to see what would come in and develop an 'options' list from that? Seem like a bad idea to anyone?
instrumentSerialNumber String Provides unique ID at the station level. This can act as a unique ID for the instrument.
instrumentManufacturer String
modelName String
rawFrequency Number The raw sampling frequency of the instrument (e.g. min, sec, hr, day) I think this will be the same for different pollutants measured by the same instrument. Feedback from others who may disagree??
reportingFrequency Number The reporting sampling frequency of the instrument (e.g. min, sec, hr, day) I think this will be the same for different pollutants measured by the same instrument. Feedback from others who may disagree?
measurementStyle String Automated, Manual, Unknown
calibrationProcedures String This would be specific to the instrument. This, left open-ended (which I think it needs to be), will likely have a wide variation of text lengths input by the particular editor and the particular station.
inletHeight number Height of intake inlet, if known, in meters.
installationDate Object Installation data for instrument. ISO timestamp
instrumentActive String True, False, Unknown options. Is the instrument still active?
deactivatedInstrumentDate Object If the instrument has been deactivated, what date did this occur?
otherInstrumentNotes String Any other relevant notes about this instrument?

Other Information:

Field Type Required Description Comment/Q
input[x]Date Object A record of entries added on data and history of edits, where x is the x -th edit) I am unclear if we can do this or if this should be listed as parameter in this format. Basically, would want some sort of ability to version.
inputAuthor[x] String Who added this entry? Could this contain contact info, like an email address
notesByAuthor[x] String Any other relevant notes to add about this station or instruments therein?

Edit: Forgot to add 'stationAltitude', just added.

sethvincent commented 5 years ago

Great, this expanded list of fields makes a lot of sense.

A few small suggestions:

mobile

The notion of a mobile station is super interesting and has implications for how we identify stations and assign the unique station ids. No suggestions on this really but something we'll want to keep in mind for figuring out station uniqueness. @olafveerman

instrumentNumber

If we store instruments as an array we can get the length of the array instead of tracking the instrument count in a field.

field names

I'd be tempted to drop instrument and station from the field names as I think they will be in separate objects and that will help keep them short.

For example instead of stationID, stationName, instrumentType, and instrumentSerialNumber, it could look like:

{
  id: 'stationid'
  name: 'station name',
  instruments: [
    {
      type: 'instrument type',
      serialNumber: 'serial number'
    }
  ]
}

History/changes fields

For fields like input[_x_]Date, inputAuthor[_x_], and notesByAuthor[_x_], if we want to track the changes we would likely make this a separate table that tracks information like what these fields would contain and the diff between the old and updated fields that were changed.

Alternately, we could start with a simpler couple of fields like updateDate, the date of the last update, and updateAuthor, the author of the last update.

q005 commented 5 years ago

Great idea. I would include the calculated "uncertainty of measurement" for each pollutant. This could replace the calibration procedure. It's also useful to know whether meteorology is measured on site and, if so, what parameters. In some applications the instruments are moved around so this might need a history associated with the instrument array. It's useful to know but the data should be instrument-agnostic.