microbiomedata / nmdc-metadata

Managing metadata and policy around metadata in NMDC
https://microbiomedata.github.io/nmdc-schema/
Other
2 stars 0 forks source link

Map individual ORNL/DAAC data layer vocabularies #66

Open cmungall opened 4 years ago

cmungall commented 4 years ago

on aim1 call yesterday honing in on 1.2 deliverables we identified the individual data layer vocabs for Identify as higher priority than GCMD

For each vocab there are 3 phases:

  1. [ ] Extract individual vocabularies (ORNL)
  2. [ ] First pass automated mapping to ENVO and other vocabs (LBNL)
  3. [ ] Validate mappings (ORNL)

For the extraction, any format is fine. I suggest either skos/rdf, or simply a TSV, e.g.

  1. stable id (these seem to be numeric IDs)
  2. name
  3. description (if available)

any metadata about the layer/vocab also welcome

all files should be checked into github, this repo. Ideally a makefile for orchestrating any wget/curl steps

For 2, we will use our mapping framework and produce mappings in SSSOM format, and deposited in this repo

For 3 the procedure will be to manually spot-check the SSSOM files in this repo and make requested changes via PR/ticket

Not sure which of @usethedata and @stantonmartin and @Blancohl will do tasks 1 and 3. Note I can't assign Bruce or Stan, they need to accept my GH invites first

StantonMartin commented 4 years ago

Where do we see the GH invites? I am not seeing any under my profile?

cmungall commented 4 years ago

@StantonMartin the invite was for Bruce and Hannah, it looks like they have accepted the invite.

Can we make a start on this task? Is there anything unclear here. You can ping me on the aim1 slack if there are any questions.

StantonMartin commented 4 years ago

IGBP.json.TXT

Uploading IGBP JSON

StantonMartin commented 4 years ago

UMD.json.TXT UMD Json

StantonMartin commented 4 years ago

Note IGBP is a legacy standard: http://www.igbp.net/

cmungall commented 4 years ago

Thanks! So I assume despite being a legacy standard it is still used by some of the data layers we will access via the API, so we'll want to map them.

Note the header of the UMD file says it is "IGBP_V6", not UMD. Is this expected?

I assume that when the API is functional there will be an unambiguous way to map to one of these tables?

The two are largely identical except code 15 has a different meaning in both, and IGBP has an extra code, 16.

I noted a discrepancy with https://lpdaac.usgs.gov/documents/101/MCD12_User_Guide_V6.pdf

In this version, there is no 0 code for IGBP

I'm inclined not to trust the codes and just map the labels

StantonMartin commented 4 years ago

The discrepancy is due to different versions of the standard, the current identify tool was using version 005 which I cannot find documentation for. I found some old documentation for 5.1 but even it has a slight discrepancy with the legend that is surfaced through the current identify tool. My opinion is that we should adopt the latest version (v6) as in the user guide above as the standard, and surface the classifications from that standard. So the JSON files should map identically to the MCD12_User_Guide_V6.pdf tables regardless of what the current identify tool does. From the doc:

The product contains 13 Science Data Sets (SDS; Table 1), including 5 legacy classification schemes (IGBP, UMD, LAI, BGC, and PFT; Tables 3- 7) and a new three layer legend based on the Land Cover

So if we want to be complete we would do all 5 legacies using the version in the tables as well as the new three layer legend. I would expect that the three layer legend would be the "default" option that is surfaced if the classification argument is not passed as a parameter to the function call.

StantonMartin commented 4 years ago

List of all potential data products from the MODIS satellite that the new identify tool could query is here:

https://modis.ornl.gov/rst/api/v1/products

Blancohl commented 4 years ago

Identify_layers.zip The table is organized by layer, code, and definition. All of which were pulled directly from the identify tool. I assumed that the only information pulled by the tool would be from the datasets that are already highlighted if you go through the Identify Tool to SDAT.

I feel it's important to point out that I had to pull the legends directly from the tool. For the majority of the layers I couldn't find any documentation about the classification systems at all.

cmungall commented 4 years ago

Perfect, thanks! We'll do a first-pass automated alignment then let's talk about curating the mappings (should not be a large task)

On Fri, Jun 19, 2020 at 11:25 AM Blancohl notifications@github.com wrote:

Identify_layers.zip https://github.com/microbiomedata/nmdc-metadata/files/4806177/Identify_layers.zip The table is organized by layer, code, and definition. All of which were pulled directly from the identify tool. I assumed that the only information pulled by the tool would be from the datasets that are already highlighted if you go through the Identify Tool to SDAT.

I feel it's important to point out that I had to pull the legends directly from the tool. For the majority of the layers I couldn't find any documentation about the classification systems at all.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microbiomedata/nmdc-metadata/issues/66#issuecomment-646805650, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOKFFJR3RV6DCJKA2X3RXOURLANCNFSM4MOP7XEQ .

StantonMartin commented 4 years ago

Sounds good.

From: Chris Mungall notifications@github.com Sent: Friday, June 19, 2020 2:37 PM To: microbiomedata/nmdc-metadata nmdc-metadata@noreply.github.com Cc: Martin, Stanton martins@ornl.gov; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [microbiomedata/nmdc-metadata] Map individual ORNL/DAAC data layer vocabularies (#66)

Perfect, thanks! We'll do a first-pass automated alignment then let's talk about curating the mappings (should not be a large task)

On Fri, Jun 19, 2020 at 11:25 AM Blancohl notifications@github.com wrote:

Identify_layers.zip https://github.com/microbiomedata/nmdc-metadata/files/4806177/Identify_layers.zip The table is organized by layer, code, and definition. All of which were pulled directly from the identify tool. I assumed that the only information pulled by the tool would be from the datasets that are already highlighted if you go through the Identify Tool to SDAT.

I feel it's important to point out that I had to pull the legends directly from the tool. For the majority of the layers I couldn't find any documentation about the classification systems at all.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microbiomedata/nmdc-metadata/issues/66#issuecomment-646805650, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOKFFJR3RV6DCJKA2X3RXOURLANCNFSM4MOP7XEQ .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/microbiomedata/nmdc-metadata/issues/66#issuecomment-646809932, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO4B7EL7WCFOP7PNL4GUTHTRXOV6NANCNFSM4MOP7XEQ.

cmungall commented 4 years ago

Can you shine any light on this, is there something further that differentiates these:

Bailey Ecoregion Province,1,ice,

Bailey Ecoregion Province,2,ice,

On Fri, Jun 19, 2020 at 11:36 AM Chris Mungall cjmungall@lbl.gov wrote:

Perfect, thanks! We'll do a first-pass automated alignment then let's talk about curating the mappings (should not be a large task)

On Fri, Jun 19, 2020 at 11:25 AM Blancohl notifications@github.com wrote:

Identify_layers.zip https://github.com/microbiomedata/nmdc-metadata/files/4806177/Identify_layers.zip The table is organized by layer, code, and definition. All of which were pulled directly from the identify tool. I assumed that the only information pulled by the tool would be from the datasets that are already highlighted if you go through the Identify Tool to SDAT.

I feel it's important to point out that I had to pull the legends directly from the tool. For the majority of the layers I couldn't find any documentation about the classification systems at all.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microbiomedata/nmdc-metadata/issues/66#issuecomment-646805650, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOKFFJR3RV6DCJKA2X3RXOURLANCNFSM4MOP7XEQ .

Blancohl commented 4 years ago

@cmungall I'm sorry, but no. There was no documentation I could easily pull about how the legends were put together or how the classifications were designated. My only thought is that most of the legends were numerical values associated with a color gradient; if the legend is supposed to be read primarily by color, then I think it could be that the two colors were both ice, but because they weren't identical they couldn't use the same number.

That's the best logical guess I can make with my rudimentary understanding of remote sensing and imaging. But otherwise, no. I don't have a solid explanation.

wdduncan commented 4 years ago

Long term goal is to use SSSOM to represent mappings.

cmungall commented 4 years ago

We will start with Zobler soil layers. This is "Global Soil Type" in the csv

cmungall commented 4 years ago

@StantonMartin added metadata about the Zobler layers in #133 -- this is similar to what is in the Identify_layers.csv provided by @Blancohl but includes additional metadata about the layer itself:

https://github.com/microbiomedata/nmdc-metadata/blob/54a482cb5e06d2c5183b9b57255ea230dbb2062d/identify/Zobler_Definitions.txt#L1-L19

Additionally, each class has a mnemonic associated with it, e.g.AF for ferric acrisol:

https://github.com/microbiomedata/nmdc-metadata/blob/54a482cb5e06d2c5183b9b57255ea230dbb2062d/identify/Zobler_Definitions.txt#L20-L24

StantonMartin commented 3 years ago

Any updates on this thread? I see the Zobler types were mapped to the ontology. What about the Land use characterization from Modis? Have these terms been mapped or is it still outstanding?

T