monarch-initiative / phenopacket-store

Collection of phenopackets
https://monarch-initiative.github.io/phenopacket-store/
BSD 3-Clause "New" or "Revised" License
12 stars 4 forks source link

all values for pos in vcfRecord are strings #97

Closed cmungall closed 1 month ago

cmungall commented 1 month ago

pos should be an int: https://phenopacket-schema.readthedocs.io/en/latest/variant.html#vcfrecord

But all the values here are strings

(found by checking using linkml-validate)

E.g

find . -name "*.json" -exec grep -h '"pos"' {} \;  | head
                  "pos": "124756209",
                  "pos": "124758482",
                  "pos": "124756540",
                  "pos": "124758482",
                  "pos": "124756540",
                  "pos": "124762120",
                  "pos": "124756704",
                  "pos": "124762120",
                  "pos": "124756540",
                  "pos": "124758482",
ielis commented 1 month ago

Hi @cmungall

I think there is no bug here. The Phenopacket Schema defines VcfRecord as following:

message VcfRecord {
  string genome_assembly = 1;
  string chrom = 2;
  uint64 pos = 3;
  string id = 4;
  string ref = 5;
  string alt = 6;
  string qual = 7;
  string filter = 8;
  string info = 9;
}

Note uint64 as the pos field. Due to some constraints related to Javascript, protobuf encodes uint64 into str when serializing to JSON.

This is also why the JSON-schema of phenopacket-tools requires pos to be a str when in JSON/YAML format.

So, it is a bit odd, but I think str is correct, if protobuf is the source of truth.

The only issue is... ... that RTD requires an int :scream:
cmungall commented 1 month ago

I thought we agreed the documentation was the SoT? It clearly states positions are integers.

IMO this is really odd. I'd strongly recommend merging this PR and before the current behavior becomes enshrined. For the protobuf users (surely a minority? Everyone uses json/jsonschema/pydantic) we can just have a minimal shim layer.

cmungall commented 1 month ago

Another option is to use uint32. This means we can't ever use phenopackets for plant genomes, but I think that's fine. But I think freeing ourselves from limitations of protobuf is best

pnrobinson commented 1 month ago

@cmungall thanks for finding this. I am not sure what to do since the software is now pretty entrenched and it is working in many places. In any case, the place to fix it, if we want to, is not here, it would be upstream in the library code! I think that given the software that we are now using for about 5 other projects, this is probably something we might need to live with, or is there some easy solution that will not cause other issues? @ielis

cmungall commented 1 month ago

It’s a tough one… my belief is that it’s better to take the short term churn now for the better long term result but I may be in the minority here…

On Tue, May 28, 2024 at 4:59 AM Peter Robinson @.***> wrote:

@cmungall https://github.com/cmungall thanks for finding this. I am not sure what to do since the software is now pretty entrenched and it is working in many places. In any case, the place to fix it, if we want to, is not here, it would be upstream in the library code! I think that given the software that we are now using for about 5 other projects, this is probably something we might need to live with, or is there some easy solution that will not cause other issues? @ielis https://github.com/ielis

— Reply to this email directly, view it on GitHub https://github.com/monarch-initiative/phenopacket-store/issues/97#issuecomment-2134695130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOKX72WU6FMRY4P44PLZERBPVAVCNFSM6AAAAABILT6462VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZUGY4TKMJTGA . You are receiving this because you were mentioned.Message ID: @.***>

ielis commented 1 month ago

I am not worried too much about our use cases - we can update our projects. I worry more about the public use and I think we should do what is allowed by the semantic versioning in order to fix errors and bugs.

It is important to decide what is the SoT, I get it wrong all the time. Then, regardless of choosing RTD or protobuf, we have plenty issues that should be fixed, some of them waiting for resolution for almost 2 years. We should find a way how to "touch" the schema, this is what I perceive as a very important issue. Nothing in the software world is static. Nothing.

In this particular case, I think that fixing the issue will be a "breaking" change. Either update the RTD to str or drop Protobuf. I think it would be better to change the RTD and to free ourselves from Protobuf from Phenopacket Schema v3.0.0, after discussing with GA4GH people.


Another option is to use uint32. This means we can't ever use phenopackets for plant genomes, but I think that's fine.

To the best of my knowledge, uint32 is serialized to JSON in the same way as uint32 - as a str. I ran a unit test in Java, checked the output, and observed a str despite the Protobuf documentation advertising that int32, fixed32, uint32 are mapped to a number.

In other words, running this:

package org.phenopackets.schema.v2;

import com.google.protobuf.util.JsonFormat;
import org.ga4gh.vrsatile.v1.VcfRecord;
import org.junit.jupiter.api.Test;

public class VcfRecordPosTest {

    @Test
    public void printPos() throws Exception {
        VcfRecord record = VcfRecord.newBuilder()
                .setChrom("chr1")
                .setPos(123_456)
                .setRef("C")
                .setAlt("G")
                .build();
        String json = JsonFormat.printer().print(record);
        System.err.println(json);
    }
}

produces this:

{
  "chrom": "chr1",
  "pos": "123456",
  "ref": "C",
  "alt": "G"
}
cmungall commented 1 month ago

Another option is to use uint32. This means we can't ever use phenopackets for plant genomes, but I think that's fine.

To the best of my knowledge, uint32 is serialized to JSON in the same way as uint32 - as a str. I ran a unit test in Java, checked the output, and observed a str despite the Protobuf documentation https://protobuf.dev/programming-guides/proto3/#json advertising that int32, fixed32, uint32 are mapped to a number.

Thanks for checking, that is unfortunate

Message ID: <monarch-initiative/phenopacket-store/issues/97/2135083968@ github.com>

pnrobinson commented 1 month ago

We discussed this at length two years ago -- I think I saw the same thing when I was working on the JSON Schema. Our decision was basically that while it is not elegant, there is zero chance that anybody will actually treat the position as a string and so it is not a biggie. I think that it would be more productive to figure out what we need to do for version 3 of the Schema -- for instance, it would be easier to use linkML with a "fresh" project. I think that what we need now is really implementations and algorithms and analyses, and I think it would create a lot of unnecessary churn if we change the file structure right now! I will close this, but let's meet to plan for the future