ternaustralia / ausplotsR

R package to interact with TERN AusPlots data
GNU General Public License v3.0
10 stars 5 forks source link

Variable dictionary #12

Closed GregGuerin closed 3 years ago

GregGuerin commented 5 years ago

The output data tables (i.e. from _getauplsots call and relating to Ausplots modules like vouchers or soil properties) are described in the help files but the individual variables/columns are not defined anywhere (e.g. what they mean, their units etc). While some of this information is in the field manual and some of it is obvious or intuitive, ideally there would be a document (or link to one) that explains each data column/variable returned in the raw data from the package. The metadata that comes with an aekos download of TERN Ausplots can't be used as the data presentation is quite different.

This may need a wider discussion of how to handle this. Improving the metadata is pretty fundamental and have had a user request for this information.

@smguru @tomsaleeba @Sammunroe

Sammunroe commented 5 years ago

I guess at a very basic level we could provide links that direct people to descriptions of the metadata? But I wonder if that would send people down a rabbit hole of readme files,websites, and pdfs. If we want to try to keep some of this "in R" so that people don't have to go looking on a website, could we add a new function which allows you to call for descriptions of the metadata. I am not sure how practical this is but something like [get_ausplots_metadata (data_table="veg.voucher")], and it could return a list that we have created of the column names and their descriptions for that table. Not sure if that is a good idea but it saves people having to follow a trail of links.

tomsaleeba commented 5 years ago

Can we rely on the R help system? The benefit of that is:

  1. we know user will be able to read it because they're already running R
  2. it's there in the environment when they're working with our tool, they don't have to go elsewhere to find it
  3. the help docs are in the git repo so they're versioned at the same time as the code. This doesn't mean the doco and code can't get out of sync but if we're disciplined, we can easily keep the two in sync.
  4. they can use it offline (who is ever offline these days anyway?!?!)

The drawback is if they don't know how to use the help/docs feature of R, then they'll be really stuffed. I guess we can assume a certain level of proficiency though.

GregGuerin commented 4 years ago

Internal delivery as help pages eg would be ideal but at present we have nothing to populate that with as far as I am aware, so needs a job for someone to write a short description of each field in each table.

Sammunroe commented 4 years ago

Just picking this up now that we are working on V1.2, could we ask Emrys and Christina to write the descriptions, and just convert it to a readme? Then link that to one of the more relevant R help pages?

Sammunroe commented 4 years ago

Hi guys, I just a chatted with Emrys, and there are already "look-up" tables as spreadsheets that include all the various codes for different properties that could be converted to tables or readme files with little effort.

GregGuerin commented 4 years ago

Sounds promising. Can you follow up and circulate the material or some examples?

Sammunroe commented 4 years ago

Yes, will do when he is back from the field

GregGuerin commented 4 years ago

@Sammunroe Would be nice to have this in v1.2 - do we know yet if documentation exists that we could present without a lot of drafting?

Sammunroe commented 4 years ago

@tomsaleeba, I think you were going to make these lookup tables accessible? We could just create a new function, that just calls the lookup tables?

tomsaleeba commented 4 years ago

There's two things to talk about here:

Swapping codes for labels

First is the fact that the data frames contain codes as opposed to nice, pretty labels. For example, we would have FLO when we could instead put Floodplain. We have look up tables so I can replace all these acronym/codes with labels that are much friendlier to humans. The labels will still act like the codes in that they'll be consistent and unique, but they'll also be human readable. Here's an example of a lookup table:

 id  |     landform_pattern     
-----+--------------------------
 FLO | Floodplain
 HIL | Hills
 KAR | Karst
 LAC | Laclustrine plain
 LAV | Lava plain
 LON | Longitudinal dunefield
 LOW | Low hills
 MAD | Made land
 MAR | Marine plain
 MEA | Meander plain
...

I've started this work already (a long time ago) and that's not a problem to finish off, unless we don't want this change? I'm tracking those other changes in this issue: https://github.com/ternandsparrow/swarm-rest/issues/4.

Make more description available

The second thing is making more description for the values available. That same table above has more columns that I didn't include, such as a description:

id landform_pattern description
FLO Floodplain Alluvial plain characterised by frequently active erosion and aggradation by channelled or overbank stream flow. Unless otherwise specified; 'frequently active' is to mean that flow has an Average Recurrence Interval of 50 years or less. Included types of landform pattern are: bar plain; meander plain; covered plain; anastomotic plain. Related relict landform patterns are: stagnant alluvial plain; terrace; terraced land (partly relict).
HIL Hills Landform pattern of high relief (90-300 m) with gently inclined to precipitous slopes. Fixed; shallow; erosional stream channels; closely to very widely spaced; form a non-directional or convergent; integrated tributary network. There is continuously active erosion by wash and creep and; in some cases; rarely active erosion by landslides.
KAR Karst Landform pattern of unspecified relief and slope typically with fixed; deep; erosional stream channels forming a non-directional; disintegrated tributary pattern and many closed depressions without stream channels. It is eroded by continuously active solution and rarely active collapse; the products being removed through underground channels.

I think this is more what this issue is about. In order to make this available, I could make a function that when called would retrieve all the description information that we have available. The user can then dump the dataframe to the screen and read it. I guess we'd actually need a bunch of dataframes, one for each lookup table. Does this sound ok?

Sammunroe commented 4 years ago

I think having the description table is key to a smooth experience. People could be made to look elsewhere for this info, like our manual, but there will be others who want to stay in the R environment. So my vote would be to change the codes to labels, and make a function that retrieves the descriptions. Covers all our bases. Does it need to be a entirely new function? Could we make it an additional argument to call in get_ausplots? like description=T?

GregGuerin commented 4 years ago

@Sammunroe @tomsaleeba

That looks like just what is needed and has been sorely missing (several people have asked me where this info is).

Codes versus labels We need to be a little careful if this is across many columns in the data tables.

For that reason I'd lean towards codes with a dictionary - but it would be easier to judge if we saw all the tables.

Presentation method

  1. Should we look at some of the existing data frame metadata functionality in R to see whether anything fits? That way you'd attach the information to the data tables themselves

  2. Agree, you could add to get_ausplots (I still like one gateway, unless it just gets too complex and unwieldy), and could be metadata counterpart tables to each data module table, as long as descriptions of codes for each variable can be pooled into one data frame per data module (i.e., add to your table above a column to identify the variable so for site.info metadata: VARIABLE | CODE | NAME | DESCRIPTION || bioregion_name | MDD | Murray Darling Depression | NA ... || state | SA | South Australia | NA ...

  3. Agree internal access best (allowing user to pull out a specific item) but it would be nice to compile them all into a master pdf somewhere for reference too (even if just available on GitHub)

GregGuerin commented 4 years ago

@tomsaleeba @Sammunroe

Example: https://cran.r-project.org/web/packages/dataMeta/vignettes/dataMeta_Vignette.html

Schema is a possible fit here - and I like that it automatically gives ranges for numeric variables as well as defining categorical ones.

Manual: https://cran.r-project.org/web/packages/dataMeta/dataMeta.pdf

smguru commented 4 years ago

Hello All, We have vocabularies all these things.look at linkeddata.tern.org.au and ausplots vocabularies.

regards Guru


From: Greg Guerin notifications@github.com Sent: Friday, August 28, 2020 12:29:32 PM To: ternaustralia/ausplotsR ausplotsR@noreply.github.com Cc: smguru smguru@gmail.com; Mention mention@noreply.github.com Subject: Re: [ternaustralia/ausplotsR] Variable dictionary (#12)

@tomsaleebahttps://github.com/tomsaleeba @Sammunroehttps://github.com/Sammunroe

Example: https://cran.r-project.org/web/packages/dataMeta/vignettes/dataMeta_Vignette.html

Schema is a possible fit here

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ternaustralia/ausplotsR/issues/12#issuecomment-682289598, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AASKNJBYKNKSKCLYBXIXF3TSC4JAZANCNFSM4HCG7PWQ.

GregGuerin commented 4 years ago

It seems we could bypass dataMeta if the metadata already exists as tables and just use the functions 'attr' to assign metadata (data frames of variables codes and descriptions etc) which are then attached to a data object such as mydata$site.info and then retrieve the metadata as a user with 'attributes(mydata$site.info)', e.g. attributes(mydata$site.info)$dictionary

The benefit of this approach is there are less items in the returned data list and metadata sits right with the data but 'out of the way'

tomsaleeba commented 4 years ago

@smguru thanks for letting us know about the vocabulary repository. I don't think it fits what we're doing here as it's easier for us to pull the data direct from the source DB but it's great to know it's out there. It might come in handy.

I've added a new endpoint on the server to provide the metadata dictionary (in this commit). I've just pushed a commit to the v1.2_species_names branch in this repository that adds a function to use that new endpoint and get the metadata dictionary. You can test it right now by doing:

devtools::install_github('ternaustralia/ausplotsR@v1.2_species_names')
options("ausplotsR_api_url" = "http://dev2.inat.techotom.com") # the data is only available on this test server right now
md = ausplotsR:::.get_metadata_dictionary() # this function is *not* exported so you need the triple colons to call it
head(md)

You should see something like:

     variable code   label                                     description
1 basal_point   E1  East 1 Distance from SW corner: 10 m north; 100 m east
2 basal_point   E2  East 2 Distance from SW corner: 30 m north; 100 m east
3 basal_point   E3  East 3 Distance from SW corner: 50 m north; 100 m east
4 basal_point   E4  East 4 Distance from SW corner: 70 m north; 100 m east
5 basal_point   E5  East 5 Distance from SW corner: 90 m north; 100 m east
6 basal_point   N1 North 1 Distance from SW corner: 100 m north; 10 m west

Now, some points about this data:

  1. I've added all the look up tables that we have in the database. You can browse the schema of the database here, just look for tables that start with lut_. I'm fairly certain we have some variables in the metadata dictionary that don't appear in the ausplotsR dataframes (see below for list) and we may not have all the variables that do appear in the dataframes, I haven't yet checked.
  2. if you want me to remove variables that we don't use, I can do that. But I'd advocate for the server to send everything it can and the logic in ausplotsR can throw things it doesn't need away. This makes the server useful for other places, not just ausplotsR.
  3. Due to how databases work, I was forced to cast some numbers into strings. This only affects the code column as it's the only one that uses numbers. I don't think this will cause issues because R seems to consider a string version of a number as being identical to the number version, e.g. '1' == 1 is TRUE. If you find yourself trying to join codes in the metadata dictionary to the data frames to look up values, and things aren't matching, it's probably this. If it happens, I think our only option is for the ausplotsR logic to cast strings that are only numbers back into a number.
  4. I took some poetic licence in how this dataframe is constructed. Not all the look up tables have exactly the 3 columns we need: code, label and description. For ones that are missing description, that's easy, I just left it off. There are tables that have 4 or more columns and for those I had to make a decision about how to pull the data together. For example, pedality_grade has four columns and I chose to make grade the label value but I didn't want to lose the pedality value so I joined it into the description. We have the choice to change this to be however it needs to be but I've made an executive decision for everything so we have a starting point. The variables I've done this for are: pedality_grade, texture_grade and pit_marker_mga_zones.
  5. some lookups are used multiple times in dataframes. For example, the lithology lookup is used for both outcrop_lithology and other_outcrop_lithology. This also affects observer_veg/observer_soil/described_by and smallest_size_1/smallest_size_2 too. Right now, I've only included the former in the metadata dictionary but we have options here too:
    1. leave it like it is
    2. the server sends a duplicate of outcrop_lithology under the name other_outcrop_lithology. It's a little redundant but it saves any confusion.
    3. rename the variable from the server to something generic like lithology and the logic in ausplotsR will match it up to where it needs to go.

Variables that (possibly) aren't used in ausplotsR dataframes:

  1. basal_point
  2. coarse_frag_abund
  3. coarse_frag_shape
  4. coarse_frag_size
  5. ibra
  6. surface_soil_condition
  7. surface_strew_size

Try it out and let me know any changes that are required.

tomsaleeba commented 4 years ago

I've just pushed a new commit to the v1.2_species_names branch that uses the data from the TERN LinkedData repo. @smguru I have to eat my words from my previous comment, we're using your service :heart_eyes: . This is how we're using it: a light transformation to make it easier to consume in R.

Follow the same instructions as my previous comment (after pulling the new commit) and you'll see what we've got, example:

> head(md)
                variableCode variableLabel                    variableDefinition variableValueCode   variableValueLabel
1 FIXME, maybe erosion_state Erosion State The particular condition(s) observed.                 S           Stabilised
2 FIXME, maybe erosion_state Erosion State The particular condition(s) observed.                 P Partially stabilised
3 FIXME, maybe erosion_state Erosion State The particular condition(s) observed.                 Z               Absent
4 FIXME, maybe erosion_state Erosion State The particular condition(s) observed.                 A               Active
5 FIXME, maybe erosion_state Erosion State The particular condition(s) observed.               n/a                  N/A
6 FIXME, maybe erosion_state Erosion State The particular condition(s) observed.                NC        Not Collected
                                                                                                                               variableValueDefinition
1                One or both of the following conditions apply: no evidence of sediment movement; sides and/or floors of erosion form are revegetated.
2                                                                                  Evidence of some active erosion and some evidence of stabilisation.
3                                                                                                                                                 <NA>
4 One or both of the following conditions apply: evidence of sediment movement; sides and/or floors of erosion form are relatively bare of vegetation.
5                                                                                                                                      Not applicable.
6                                                                                                                                       Not collected.

There's points I need to make:

  1. as far as I can tell, the LinkedData repo doesn't have the names we use for the variables like basal_point. They only have the pretty names like Basal Point. Somehow we're going to have to figure out that mapping. Right now I take the pretty name, make it lowercase and replace spaces with an underscore in the hope that it'll match everything. I haven't checked though.
  2. it's Friday arvo and I'm going home just as this has deployed, so I haven't double checked anything. There may be issues with the data, so if you see anything, let me know.
smguru commented 4 years ago

Hello @Tom Saleeba tom.saleeba@adelaide.edu.au , Not sure what you are trying to do, But good to catch up to understand your needs. All the Ausplots terminologies have been created for the intent to reuse them.

regards Guru

On Fri, Sep 4, 2020 at 5:42 PM Tom Saleeba notifications@github.com wrote:

I've just pushed a new commit to the v1.2_species_names branch that uses the data from the TERN LinkedData repo. @smguru https://github.com/smguru I have to eat my words from my previous comment, we're using your service 😍 . This https://github.com/ternandsparrow/swarm-rest/blob/master/ausplots-metadata-dictionary-server/index.js#L55 is how we're using it: a light transformation to make it easier to consume in R.

Follow the same instructions as my previous comment (after pulling the new commit) and you'll see what we've got, example:

head(md)

            variableCode variableLabel                    variableDefinition variableValueCode   variableValueLabel

1 FIXME, maybe erosion_state Erosion State The particular condition(s) observed. S Stabilised

2 FIXME, maybe erosion_state Erosion State The particular condition(s) observed. P Partially stabilised

3 FIXME, maybe erosion_state Erosion State The particular condition(s) observed. Z Absent

4 FIXME, maybe erosion_state Erosion State The particular condition(s) observed. A Active

5 FIXME, maybe erosion_state Erosion State The particular condition(s) observed. n/a N/A

6 FIXME, maybe erosion_state Erosion State The particular condition(s) observed. NC Not Collected

                                                                                                                           variableValueDefinition

1 One or both of the following conditions apply: no evidence of sediment movement; sides and/or floors of erosion form are revegetated.

2 Evidence of some active erosion and some evidence of stabilisation.

3

4 One or both of the following conditions apply: evidence of sediment movement; sides and/or floors of erosion form are relatively bare of vegetation.

5 Not applicable.

6 Not collected.

There's points I need to make:

  1. as far as I can tell, the LinkedData repo doesn't have the names we use for the variables like basal_point. They only have the pretty names like Basal Point. Somehow we're going to have to figure out that mapping. Right now I take the pretty name, make it lowercase and replace spaces with an underscore in the hope that it'll match everything. I haven't checked though.
  2. it's Friday arvo and I'm going home just as this has deployed, so I haven't double checked anything. There may be issues with the data, so if you see anything, let me know.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ternaustralia/ausplotsR/issues/12#issuecomment-686974805, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASKNJBCEOJCD46UTNX56FDSECK7VANCNFSM4HCG7PWQ .

tomsaleeba commented 4 years ago

@smguru We're using the data from the LinkedData repo to provide more context in ausplotsR. Currently a user will look at the bioregion_name column in their R dataframe and wonder what DAC means. By including the data from LinkedData, the user will be able to see the column contains IBRA codes and what the full name for DAC is. Basically providing a better user experience.

smguru commented 4 years ago

Hello Tom, DAC is just a code, each bioregion has a code and it is represented in a vocabulary.

regards Guru


From: Tom Saleeba notifications@github.com Sent: Monday, September 7, 2020 2:10:19 PM To: ternaustralia/ausplotsR ausplotsR@noreply.github.com Cc: smguru smguru@gmail.com; Mention mention@noreply.github.com Subject: Re: [ternaustralia/ausplotsR] Variable dictionary (#12)

@smguruhttps://github.com/smguru We're using the data from the LinkedData repo to provide more context in ausplotsR. Currently a user will look at the bioregion_name column in their R dataframe and wonder what DAC means. By including the data from LinkedData, the user will be able to see the column contains IBRA codes and what the full name for DAC is. Basically providing a better user experience.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ternaustralia/ausplotsR/issues/12#issuecomment-688019555, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AASKNJHAMJSX5SWWSO4E42TSERMKXANCNFSM4HCG7PWQ.

edmondchuc commented 4 years ago

Hi @tomsaleeba, please if you can use the IBRA codes list from http://linked.data.gov.au/dataset/bioregion/IBRA7 instead of http://linked.data.gov.au/def/ausplots-cv/a9754a72-c2f7-4a9d-9686-9df78fb65e62. The latter was generated from the AusPlots Rangelands database while the former was generated from the authoritative source. 👍

tomsaleeba commented 4 years ago

Thanks @edmondchuc, I'll make that change. :heart_eyes:

GregGuerin commented 3 years ago

I'll close - @tomsaleeba has created the dictionary and it can be made more complete over time