phenopackets / phenopacket-schema

Repository for the GA4GH phenopacket schema
https://phenopacket-schema.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
79 stars 30 forks source link

The cardinality of `Resource`s may be too strict #371

Closed ielis closed 1 year ago

ielis commented 1 year ago

Let's consider a minimal phenopacket:

{
  "id": "phenopacket-id",
  "metaData": {
    "created": "2023-06-14T13:39:42.714Z",
    "createdBy": "PhenopacketLab",
    "phenopacketSchemaVersion": "2.0"
  }
}

The phenopacket currently fails the validator because it has 0 Resources and the schema prescribes presence of 1..* Resources. It is a "kosher" phenopacket though, it has all required elements. However, due to no PhenotypicFeatures, Measurements, etc., there are no ontologies used and, consequently, no Resources required.

I would suggest updating the Resource cardinality to 0..* in the schema.

cmungall commented 1 year ago

Or we could go the other way, and say an empty phenopacket with no patients is invalid

On Wed, Jun 14, 2023 at 5:59 PM Daniel Danis @.***> wrote:

Let's consider a minimal phenopacket:

{ "id": "phenopacket-id", "metaData": { "created": "2023-06-14T13:39:42.714Z", "createdBy": "PhenopacketLab", "phenopacketSchemaVersion": "2.0" } }

The phenopacket currently fails the validator because it has 0 Resources and the schema prescribes presence of 1..* Resources. It is a "kosher" phenopacket though, it has all required elements. However, due to no PhenotypicFeatures, Measurements, etc., there are no ontologies used and, consequently, no Resources required.

I would suggest updating the Resource cardinality to 0..* in the schema.

— Reply to this email directly, view it on GitHub https://github.com/phenopackets/phenopacket-schema/issues/371, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOI62JVGQQGQAZOVHSLXLJM5TANCNFSM6AAAAAAZHDPPGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pnrobinson commented 1 year ago

I would agree with Chris, there is no point in a phenopacket without content and it is OK for the validator to point it out as an error -- in real life, if somebody submits a phenopacket like this, it almost surely means that something went wrong and that our validator should point this out!

ielis commented 1 year ago

I agree as well, and I think we need to point this up in the schema & the docs, so that I can implement the appropriate checks in the phenopacket-tools.

julesjacobsen commented 1 year ago

Consider this example:

{
  "id": "phenopacket-id",
  "subject": {
    "id": "patient-id",
    "sex": "MALE"
  },
  "metaData": {
    "created": "2023-06-14T13:39:42.714Z",
    "createdBy": "PhenopacketLab",
    "phenopacketSchemaVersion": "2.0"
  }
}

There is no valid Resource for this but it's a valid, if not very informative, phenopacket. We were aiming for flexible (ok, this comes at a semantic cost) so this should be valid, but wouldn't be considered valid under the current "requires at least one Resource" category. So I think we should relax the Resource constraint from 1.. to 0.. as this will allow the simplest possible phenopacket and will also be a non-breaking change to existing phenopackets.

Where relaxing the Resource constraint to 0..* falls down is that standard schema validation tools won't be of any use because the requirements for Resource depends on what is being expressed although certainly anything with a CURIE form requires a Resource which will include anything present in phenotypicFeatures, measurements, biosamples, interpretations, diseases or medicalActions fields and possibly individual depending on what is included there.

cmungall commented 1 year ago

I think your suggestion makes sense Jules, I retract my original stricter suggestion.

And your internal LLMs are all predicting what is coming next: these kinds of validation issues are fixed by LinkML. Just include conditional rules that constrain Resources based on the rest of the phenopacket...

On Fri, Jun 16, 2023 at 1:30 AM Jules Jacobsen @.***> wrote:

Consider this example:

{ "id": "phenopacket-id", "subject": { "id": "patient-id", "sex": "MALE" }, "metaData": { "created": "2023-06-14T13:39:42.714Z", "createdBy": "PhenopacketLab", "phenopacketSchemaVersion": "2.0" } }

There is no valid Resource for this but it's a valid, if not very informative, phenopacket. We were aiming for flexible (ok, this comes at a semantic cost) so this should be valid, but wouldn't be considered valid under the current "requires at least one Resource" category. So I think we should relax the Resource constraint from 1.. to 0.. as this will allow the simplest possible phenopacket and will also be a non-breaking change to existing phenopackets.

Where relaxing the Resource constraint to 0..* falls down is that standard schema validation tools won't be of any use because the requirements for Resource depends on what is being expressed although certainly anything with a CURIE form requires a Resource which will include anything present in phenotypicFeatures, measurements, biosamples, interpretations, diseases or medicalActions fields and possibly individual depending on what is included there.

— Reply to this email directly, view it on GitHub https://github.com/phenopackets/phenopacket-schema/issues/371#issuecomment-1594315092, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOINTOKX4GCZTEGVFPTXLQKTVANCNFSM6AAAAAAZHDPPGU . You are receiving this because you commented.Message ID: @.***>

ielis commented 1 year ago

@julesjacobsen, @cmungall if you agree on relaxing the constraint, please do so in the docs. I will update the validator in phenopacket-tools afterwards. At the moment, the validation of the phenopacket example above fails due to missing Resources.

pnrobinson commented 1 year ago

Honestly it does not seem a good idea to change the schema for this -- there is a lot to be said for stability and there is not a big reason for a change. In any real-life situation, people will be using the schema in combination with some ontologies and so for instance in a rare disease setting we should add a reference to HPO even if we do not add a term because say maybe we were searching for HPO data but did not find anything. There is no reason that we should accept an empty phenopacket as valid.

ielis commented 1 year ago

Your approach can open another can of worms. Do you think the following is OK?

{
  "id": "phenopacket-id",
  "subject": {
    "id": "patient-id",
    "sex": "MALE"
  },
  "metaData": {
    "created": "2023-06-14T13:39:42.714Z",
    "createdBy": "PhenopacketLab",
    "resources": [{
      "id": "hp",
      "name": "human phenotype ontology",
      "url": "http://purl.obolibrary.org/obo/hp.owl",
      "version": "2018-03-08",
      "namespacePrefix": "HP",
      "iriPrefix": "http://purl.obolibrary.org/obo/HP_"
    }, {
      "id": "pato",
      "name": "PhenotypicFeature And Trait Ontology",
      "url": "http://purl.obolibrary.org/obo/pato.owl",
      "version": "2018-03-28",
      "namespacePrefix": "PATO",
      "iriPrefix": "http://purl.obolibrary.org/obo/PATO_"
    }, {
      "id": "geno",
      "name": "Genotype Ontology",
      "url": "http://purl.obolibrary.org/obo/geno.owl",
      "version": "19-03-2018",
      "namespacePrefix": "GENO",
      "iriPrefix": "http://purl.obolibrary.org/obo/GENO_"
    }, {
      "id": "ncbitaxon",
      "name": "NCBI organismal classification",
      "url": "http://purl.obolibrary.org/obo/ncbitaxon.owl",
      "version": "2018-03-02",
      "namespacePrefix": "NCBITaxon",
      "iriPrefix": "http://purl.obolibrary.org/obo/NCBITaxon_"
    }, {
      "id": "eco",
      "name": "Evidence \u0026 Conclusion Ontology (ECO)",
      "url": "http://purl.obolibrary.org/obo/eco.owl",
      "version": "2018-11-10",
      "namespacePrefix": "ECO",
      "iriPrefix": "http://purl.obolibrary.org/obo/ECO_"
    }, {
      "id": "omim",
      "name": "An Online Catalog of Human Genes and Genetic Disorders",
      "url": "https://www.omim.org",
      "version": "2023-01-27",
      "namespacePrefix": "OMIM",
      "iriPrefix": "https://www.omim.org/entry/"
    }],
    "phenopacketSchemaVersion": "2.0"
  }
}

The phenopacket includes loads of unused Resources, which we better not allow.

I think it would be much cleaner to informally agree upon doing the following:

{
  "id": "phenopacket-id",
  "subject": {
    "id": "patient-id",
    "sex": "MALE",
    "taxonomy": {
      "id": "NCBITaxon:9606",
      "label": "Homo sapiens"
    }
  },
  "metaData": {
    "created": "2023-06-14T13:39:42.714Z",
    "createdBy": "PhenopacketLab",
    "resources": [{
      "id": "ncbitaxon",
      "name": "NCBI organismal classification",
      "url": "http://purl.obolibrary.org/obo/ncbitaxon.owl",
      "version": "2018-03-02",
      "namespacePrefix": "NCBITaxon",
      "iriPrefix": "http://purl.obolibrary.org/obo/NCBITaxon_"
    }],
    "phenopacketSchemaVersion": "2.0"
  }
}

However, these edge cases need to be considered in the next schema iteration.

pnrobinson commented 1 year ago

Well, I think it would be fine just to reject the phenopacket at the very top. There is no reason to reject the phenopacket with the many resources -- that just says that the resources were used in some way in making the phenopacket (perhaps there were no matching terms to the version). These are edge cases that are useless though, and it does not make sense for us to work hard to figure out what to do. It is easiest just to require at least one ontology resource for a valid phenopacket.

ielis commented 1 year ago

I cannot agree with you regarding the non-importance of the edge cases. Well defined model is attractive (e.g. Rust syntax), while specification with many clunky corner cases is annoying (e.g. VCF).

This issue arose with SQA in PhenopacketLab, with empty phenopackets failing the validation from this obscure reason. We'll go with adding taxonomy into the subject and, consequently, including NCBITaxon in the Resources.

pnrobinson commented 1 year ago

I agree edge cases are important and was saying the solution is arbitrary and it is best not to change the schema unless there is a true advantage! NCBI Taxon sounds fine