microbiomedata / nmdc-metadata

Managing metadata and policy around metadata in NMDC
https://microbiomedata.github.io/nmdc-schema/
Other
2 stars 0 forks source link

Add more integrity checks on fields in JSON schema (ENVO, KEGG terms, etc.) #308

Closed jeffbaumes closed 3 years ago

jeffbaumes commented 3 years ago

There was an instance where two forms of KEGG id prefixes got into the pilot metadata so we want to guard against things like that in the future.

jeffbaumes commented 3 years ago

@wdduncan I assigned you and @jbeezley to this one. When ingesting annotations, we need to make sure the schema validates only valid ids (e.g. ENVO, KEGG, etc.) and I believe some regexes on the schema would help here. Could you help us figure out where to add these?

wdduncan commented 3 years ago

Thanks @jeffbaumes
Can you post some examples of the bad data?

jeffbaumes commented 3 years ago

I believe there was an issue where function IDs sometimes came through with prefix KO: and sometimes came through with prefix KEGG_ORTHOLOGY:. A regex in schema validation could catch things like this. Basically if any ID should match a regex we should put a regex in the schema. @jbeezley may have further thoughts here.

dehays commented 3 years ago

Didn't see the KEGG case, but this is the same as the ENVO case in #294 - @wdduncan The request here is to identify fields for which the raw value must conform to a regex and then do that validation. Looks like JSON Schema can do that ( https://json-schema.org/understanding-json-schema/reference/regular_expressions.html ) but that doesn't mean that the JSON Schema that gets generated in NMDC can include that. Ideas Bill?

ssarrafan commented 3 years ago

Moving to May sprint per Bill's request and adding Large size label.
Comment from Bill: I need to investigate how to use regular expressions in jsonschema to validate data. So, I will give an estimate of one week, but the issue is quite open ended. The only use case of KEGG_ORTHOLOGY. I imagine there are other use cases.

cmungall commented 3 years ago

Sorry just saw this now.

There are different levels of checks

  1. Is the ID prefix valid? (e.g. KEGG.KO vs KEGG.ORTHOLOG)
  2. Is the local part of the ID syntactically conformant? (e.g. KEGG:K\d+)
  3. Is the ID valid and appropriate?

The first is very easy to do with using the existing id_prefixes annotated in the schema

The second we can do by adding additional regexes (slot_usage)

The 3rd is more difficult, a few approaches:

  1. Additional code outside the schema that does lookups with public APIs to check the ID is valid and not obsolete and yields a semantically appropriate entity
  2. We enumerate all valid IDs and include as an enum in the schema. Note these could be quite large, but we would generate these programmatically and include as a separate import
  3. We take advantage of the ability in linkml to identify a codeset, and create this separately

But we can tackle this incrementally, start with 1 then 2

ssarrafan commented 3 years ago

@cmungall and @wdduncan what would you like to do with this issue? Can we close it as the considering is done? And open a new ticket for June to start implementing per Chris's comment?

cmungall commented 3 years ago

Let's make one ticket for 1+2, another for 3

ssarrafan commented 3 years ago

Created two GH issues for June, assigned to @cmungall @wdduncan and @turbomam. Please remove wrong assignments if any or let me know.

https://github.com/microbiomedata/nmdc-metadata/issues/360 https://github.com/microbiomedata/nmdc-metadata/issues/362

Closing this issue.

turbomam commented 3 years ago

Hello, all. You may have seen that this was moved to https://github.com/microbiomedata/nmdc-schema/issues/69

@jeffbaumes , I don't think you and I have had much on-on-one conversation. I joined @cmungall's group mid March.

Can someone share some JSON objects that contain some of the error patterns mentioned in this original issue post?

I have reviewed the declared orthology prefixes, and I have added acceptable patterns for the local part of their IDs in

which corresponds to

I can continue to add local portion patterns for other classes in annotation.yaml (or even other files) based on those that have id_prefixes slots. See