Closed cmungall closed 1 month ago
Hi @cmungall
I think there is no bug here. The Phenopacket Schema defines VcfRecord as following:
message VcfRecord {
string genome_assembly = 1;
string chrom = 2;
uint64 pos = 3;
string id = 4;
string ref = 5;
string alt = 6;
string qual = 7;
string filter = 8;
string info = 9;
}
Note uint64
as the pos
field. Due to some constraints related to Javascript, protobuf encodes uint64
into str
when serializing to JSON.
This is also why the JSON-schema of phenopacket-tools
requires pos
to be a str
when in JSON/YAML format.
So, it is a bit odd, but I think str
is correct, if protobuf is the source of truth.
I thought we agreed the documentation was the SoT? It clearly states positions are integers.
IMO this is really odd. I'd strongly recommend merging this PR and before the current behavior becomes enshrined. For the protobuf users (surely a minority? Everyone uses json/jsonschema/pydantic) we can just have a minimal shim layer.
Another option is to use uint32
. This means we can't ever use phenopackets for plant genomes, but I think that's fine. But I think freeing ourselves from limitations of protobuf is best
@cmungall thanks for finding this. I am not sure what to do since the software is now pretty entrenched and it is working in many places. In any case, the place to fix it, if we want to, is not here, it would be upstream in the library code! I think that given the software that we are now using for about 5 other projects, this is probably something we might need to live with, or is there some easy solution that will not cause other issues? @ielis
It’s a tough one… my belief is that it’s better to take the short term churn now for the better long term result but I may be in the minority here…
On Tue, May 28, 2024 at 4:59 AM Peter Robinson @.***> wrote:
@cmungall https://github.com/cmungall thanks for finding this. I am not sure what to do since the software is now pretty entrenched and it is working in many places. In any case, the place to fix it, if we want to, is not here, it would be upstream in the library code! I think that given the software that we are now using for about 5 other projects, this is probably something we might need to live with, or is there some easy solution that will not cause other issues? @ielis https://github.com/ielis
— Reply to this email directly, view it on GitHub https://github.com/monarch-initiative/phenopacket-store/issues/97#issuecomment-2134695130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOKX72WU6FMRY4P44PLZERBPVAVCNFSM6AAAAABILT6462VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZUGY4TKMJTGA . You are receiving this because you were mentioned.Message ID: @.***>
I am not worried too much about our use cases - we can update our projects. I worry more about the public use and I think we should do what is allowed by the semantic versioning in order to fix errors and bugs.
It is important to decide what is the SoT, I get it wrong all the time. Then, regardless of choosing RTD or protobuf, we have plenty issues that should be fixed, some of them waiting for resolution for almost 2 years. We should find a way how to "touch" the schema, this is what I perceive as a very important issue. Nothing in the software world is static. Nothing.
In this particular case, I think that fixing the issue will be a "breaking" change. Either update the RTD to str
or drop Protobuf. I think it would be better to change the RTD and to free ourselves from Protobuf from Phenopacket Schema v3.0.0, after discussing with GA4GH people.
Another option is to use uint32. This means we can't ever use phenopackets for plant genomes, but I think that's fine.
To the best of my knowledge, uint32
is serialized to JSON in the same way as uint32
- as a str
. I ran a unit test in Java, checked the output, and observed a str
despite the Protobuf documentation advertising that int32
, fixed32
, uint32
are mapped to a number
.
In other words, running this:
package org.phenopackets.schema.v2;
import com.google.protobuf.util.JsonFormat;
import org.ga4gh.vrsatile.v1.VcfRecord;
import org.junit.jupiter.api.Test;
public class VcfRecordPosTest {
@Test
public void printPos() throws Exception {
VcfRecord record = VcfRecord.newBuilder()
.setChrom("chr1")
.setPos(123_456)
.setRef("C")
.setAlt("G")
.build();
String json = JsonFormat.printer().print(record);
System.err.println(json);
}
}
produces this:
{
"chrom": "chr1",
"pos": "123456",
"ref": "C",
"alt": "G"
}
Another option is to use uint32. This means we can't ever use phenopackets for plant genomes, but I think that's fine.
To the best of my knowledge, uint32 is serialized to JSON in the same way as uint32 - as a str. I ran a unit test in Java, checked the output, and observed a str despite the Protobuf documentation https://protobuf.dev/programming-guides/proto3/#json advertising that int32, fixed32, uint32 are mapped to a number.
Thanks for checking, that is unfortunate
Message ID: <monarch-initiative/phenopacket-store/issues/97/2135083968@ github.com>
We discussed this at length two years ago -- I think I saw the same thing when I was working on the JSON Schema. Our decision was basically that while it is not elegant, there is zero chance that anybody will actually treat the position as a string and so it is not a biggie. I think that it would be more productive to figure out what we need to do for version 3 of the Schema -- for instance, it would be easier to use linkML with a "fresh" project. I think that what we need now is really implementations and algorithms and analyses, and I think it would create a lot of unnecessary churn if we change the file structure right now! I will close this, but let's meet to plan for the future
pos should be an int: https://phenopacket-schema.readthedocs.io/en/latest/variant.html#vcfrecord
But all the values here are strings
(found by checking using linkml-validate)
E.g