Closed ielis closed 1 year ago
Or we could go the other way, and say an empty phenopacket with no patients is invalid
On Wed, Jun 14, 2023 at 5:59 PM Daniel Danis @.***> wrote:
Let's consider a minimal phenopacket:
{ "id": "phenopacket-id", "metaData": { "created": "2023-06-14T13:39:42.714Z", "createdBy": "PhenopacketLab", "phenopacketSchemaVersion": "2.0" } }
The phenopacket currently fails the validator because it has 0 Resources and the schema prescribes presence of 1..* Resources. It is a "kosher" phenopacket though, it has all required elements. However, due to no PhenotypicFeatures, Measurements, etc., there are no ontologies used and, consequently, no Resources required.
I would suggest updating the Resource cardinality to 0..* in the schema.
— Reply to this email directly, view it on GitHub https://github.com/phenopackets/phenopacket-schema/issues/371, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOI62JVGQQGQAZOVHSLXLJM5TANCNFSM6AAAAAAZHDPPGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I would agree with Chris, there is no point in a phenopacket without content and it is OK for the validator to point it out as an error -- in real life, if somebody submits a phenopacket like this, it almost surely means that something went wrong and that our validator should point this out!
I agree as well, and I think we need to point this up in the schema & the docs, so that I can implement the appropriate checks in the phenopacket-tools.
Consider this example:
{
"id": "phenopacket-id",
"subject": {
"id": "patient-id",
"sex": "MALE"
},
"metaData": {
"created": "2023-06-14T13:39:42.714Z",
"createdBy": "PhenopacketLab",
"phenopacketSchemaVersion": "2.0"
}
}
There is no valid Resource
for this but it's a valid, if not very informative, phenopacket. We were aiming for flexible (ok, this comes at a semantic cost) so this should be valid, but wouldn't be considered valid under the current "requires at least one Resource
" category. So I think we should relax the Resource
constraint from 1.. to 0.. as this will allow the simplest possible phenopacket and will also be a non-breaking change to existing phenopackets.
Where relaxing the Resource
constraint to 0..* falls down is that standard schema validation tools won't be of any use because the requirements for Resource
depends on what is being expressed although certainly anything with a CURIE form requires a Resource
which will include anything present in phenotypicFeatures
, measurements
, biosamples
, interpretations
, diseases
or medicalActions
fields and possibly individual
depending on what is included there.
I think your suggestion makes sense Jules, I retract my original stricter suggestion.
And your internal LLMs are all predicting what is coming next: these kinds of validation issues are fixed by LinkML. Just include conditional rules that constrain Resources based on the rest of the phenopacket...
On Fri, Jun 16, 2023 at 1:30 AM Jules Jacobsen @.***> wrote:
Consider this example:
{ "id": "phenopacket-id", "subject": { "id": "patient-id", "sex": "MALE" }, "metaData": { "created": "2023-06-14T13:39:42.714Z", "createdBy": "PhenopacketLab", "phenopacketSchemaVersion": "2.0" } }
There is no valid Resource for this but it's a valid, if not very informative, phenopacket. We were aiming for flexible (ok, this comes at a semantic cost) so this should be valid, but wouldn't be considered valid under the current "requires at least one Resource" category. So I think we should relax the Resource constraint from 1.. to 0.. as this will allow the simplest possible phenopacket and will also be a non-breaking change to existing phenopackets.
Where relaxing the Resource constraint to 0..* falls down is that standard schema validation tools won't be of any use because the requirements for Resource depends on what is being expressed although certainly anything with a CURIE form requires a Resource which will include anything present in phenotypicFeatures, measurements, biosamples, interpretations, diseases or medicalActions fields and possibly individual depending on what is included there.
— Reply to this email directly, view it on GitHub https://github.com/phenopackets/phenopacket-schema/issues/371#issuecomment-1594315092, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOINTOKX4GCZTEGVFPTXLQKTVANCNFSM6AAAAAAZHDPPGU . You are receiving this because you commented.Message ID: @.***>
@julesjacobsen, @cmungall if you agree on relaxing the constraint, please do so in the docs. I will update the validator in phenopacket-tools
afterwards. At the moment, the validation of the phenopacket example above fails due to missing Resource
s.
Honestly it does not seem a good idea to change the schema for this -- there is a lot to be said for stability and there is not a big reason for a change. In any real-life situation, people will be using the schema in combination with some ontologies and so for instance in a rare disease setting we should add a reference to HPO even if we do not add a term because say maybe we were searching for HPO data but did not find anything. There is no reason that we should accept an empty phenopacket as valid.
Your approach can open another can of worms. Do you think the following is OK?
{
"id": "phenopacket-id",
"subject": {
"id": "patient-id",
"sex": "MALE"
},
"metaData": {
"created": "2023-06-14T13:39:42.714Z",
"createdBy": "PhenopacketLab",
"resources": [{
"id": "hp",
"name": "human phenotype ontology",
"url": "http://purl.obolibrary.org/obo/hp.owl",
"version": "2018-03-08",
"namespacePrefix": "HP",
"iriPrefix": "http://purl.obolibrary.org/obo/HP_"
}, {
"id": "pato",
"name": "PhenotypicFeature And Trait Ontology",
"url": "http://purl.obolibrary.org/obo/pato.owl",
"version": "2018-03-28",
"namespacePrefix": "PATO",
"iriPrefix": "http://purl.obolibrary.org/obo/PATO_"
}, {
"id": "geno",
"name": "Genotype Ontology",
"url": "http://purl.obolibrary.org/obo/geno.owl",
"version": "19-03-2018",
"namespacePrefix": "GENO",
"iriPrefix": "http://purl.obolibrary.org/obo/GENO_"
}, {
"id": "ncbitaxon",
"name": "NCBI organismal classification",
"url": "http://purl.obolibrary.org/obo/ncbitaxon.owl",
"version": "2018-03-02",
"namespacePrefix": "NCBITaxon",
"iriPrefix": "http://purl.obolibrary.org/obo/NCBITaxon_"
}, {
"id": "eco",
"name": "Evidence \u0026 Conclusion Ontology (ECO)",
"url": "http://purl.obolibrary.org/obo/eco.owl",
"version": "2018-11-10",
"namespacePrefix": "ECO",
"iriPrefix": "http://purl.obolibrary.org/obo/ECO_"
}, {
"id": "omim",
"name": "An Online Catalog of Human Genes and Genetic Disorders",
"url": "https://www.omim.org",
"version": "2023-01-27",
"namespacePrefix": "OMIM",
"iriPrefix": "https://www.omim.org/entry/"
}],
"phenopacketSchemaVersion": "2.0"
}
}
The phenopacket includes loads of unused Resource
s, which we better not allow.
I think it would be much cleaner to informally agree upon doing the following:
{
"id": "phenopacket-id",
"subject": {
"id": "patient-id",
"sex": "MALE",
"taxonomy": {
"id": "NCBITaxon:9606",
"label": "Homo sapiens"
}
},
"metaData": {
"created": "2023-06-14T13:39:42.714Z",
"createdBy": "PhenopacketLab",
"resources": [{
"id": "ncbitaxon",
"name": "NCBI organismal classification",
"url": "http://purl.obolibrary.org/obo/ncbitaxon.owl",
"version": "2018-03-02",
"namespacePrefix": "NCBITaxon",
"iriPrefix": "http://purl.obolibrary.org/obo/NCBITaxon_"
}],
"phenopacketSchemaVersion": "2.0"
}
}
However, these edge cases need to be considered in the next schema iteration.
Well, I think it would be fine just to reject the phenopacket at the very top. There is no reason to reject the phenopacket with the many resources -- that just says that the resources were used in some way in making the phenopacket (perhaps there were no matching terms to the version). These are edge cases that are useless though, and it does not make sense for us to work hard to figure out what to do. It is easiest just to require at least one ontology resource for a valid phenopacket.
I cannot agree with you regarding the non-importance of the edge cases. Well defined model is attractive (e.g. Rust syntax), while specification with many clunky corner cases is annoying (e.g. VCF).
This issue arose with SQA in PhenopacketLab, with empty phenopackets failing the validation from this obscure reason. We'll go with adding taxonomy into the subject and, consequently, including NCBITaxon
in the Resource
s.
I agree edge cases are important and was saying the solution is arbitrary and it is best not to change the schema unless there is a true advantage! NCBI Taxon sounds fine
Let's consider a minimal phenopacket:
The phenopacket currently fails the validator because it has 0
Resource
s and the schema prescribes presence of1..*
Resource
s. It is a "kosher" phenopacket though, it has all required elements. However, due to noPhenotypicFeature
s,Measurement
s, etc., there are no ontologies used and, consequently, noResource
s required.I would suggest updating the
Resource
cardinality to0..*
in the schema.