openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Replace variants separated wth Pipe by a JSON structure #553

Closed Peter-J-Freeman closed 7 months ago

Peter-J-Freeman commented 7 months ago

Is your feature request related to a problem? Please describe. We currently use a | to string together variants

Describe the solution you'd like @ifokkema and Reece Hart suggested JSON could be used instead

Hopefully @ifokkema can make suggestions :). @John-F-Wagstaff . Will need your input on this.

ifokkema commented 7 months ago

Hopefully @ifokkema can make suggestions :). @John-F-Wagstaff . Will need your input on this.

Sure! Check out the LOVD API Swagger page. Try the checkHGVS endpoint. The interface allows for two example inputs; a single variant description, and a JSON formatted list of variant descriptions. That's the syntax I use to submit multiple variants to my API.

Respectively: NM_002225.3:c.157C>T versus ["NM_002225.3:c.157C>T","NC_000015.9:g.40699840C>T"]

My API's logic is relatively simple; if the input starts with a [, try to parse it as JSON. If that doesn't work, regard it as one single variant description. If it does parse, process each list item as one variant description.

John-F-Wagstaff commented 7 months ago

Yes if we want to switch to JSON we definitely want to use @ifokkema 's method or similar, most end users will be using the single variant case and we don't want to break that.

Just saying we split on space should also work, using python's default split behaviour(i.e. split on any of ' ' \t \n etc.), it avoids some of the complexity, for us and for users, but could be more typo prone.

I don't have a problem with either, so I would do what you think works best for the core VV code.

When we are passing variants around internally (as well as via/from different VV endpoints) it would be a lot nicer to be able to use a list instead of relying on repackaging variants into a special case overload for the same string input. This might be a bit much of a extra job to be worth doing though.

Peter-J-Freeman commented 7 months ago

Agreed.

It will require an update the underling VV code is several places

variantvalidator_variants

select_transcripts

gene2transcripts

Changes to the rest_vv code endpoints and also probably need to update the web interface so select_transcripts becomes a list not a textbox

so, if I unserstand correctly, the JSON is just a list?

["v1", "v2", "v3"] We don't need something like {"variants": ["v1", "v2", "v3"]}?

We always process lists so no need to check for JSON, we just skip the split statements.

Peter-J-Freeman commented 7 months ago

Ah wait, this answers it

https://www.w3schools.com/js/js_json_arrays.asp

We need the string to be '["v1", "v2", "v3"]' which we can then use json.loads to convert into a list.

@John-F-Wagstaff , am I OK to crack on with this?

leicray commented 7 months ago

Be prepared for some users to paste in lists of variants that have been copied from Microsoft Word documents where smart quotes are automatically generated to replace straight quotes:

[“v1”, “v2”, “v3”]

Not all users are savvy enough to avoid word processors.

Peter-J-Freeman commented 7 months ago

Good point @leicray. Adding this to tests etc to see how to handle it

John-F-Wagstaff commented 7 months ago

Hello Pete

This is a big breaking change to how we work, we don't want to just jump in, how do you intend to handle backwards compatibility, will we have an option to enable it for VV at least?

On the other front I did mention this in passing last time we chatted on this subject, but obviously did not make it clear in my last comment. I am OK with you doing what you feel is best internally, for the core VV code, but only in the context of us using it internally, not if you want to ask non technical end users to enter things in JSON.

API users have different expectations, but we do not want to present users of the website with the opportunity to run into syntax errors for things that used to work, or which used to just break the validation of the single variant they typoed. [“v1”, “v2”, “v3”] is bad enough but ["v1"", "v2", "v3"], ["v"1", "v2", "v3"] and ["v1, "v2", "v3"] are all going to happen, and cause frustration regularly enough. If we are reading in JSON using normal (well tested standard) tools then this kind of failure will break the whole input. At the very least I don't think we gain anything from asking for ["v1", "v2", "v3"] instead of "v1", "v2", "v3" from the actual website users.

This is particularly important for us as usability, and friendliness to non technical users, is a very high priority for our website. If we want to be The Place to go to get variants validated we have to take this into account, or get supplanted for more normal users by what may even be heavily inferior tools.

This does not necessarily constrain the internals though, as we have multiple stages each with their own natural data exchange format, several of which are different, and none of which have to be the same. For example user->website java script->API->Variant formatter(optional)->Variant Vallidator which should be a fairly normal process for us, has many different steps. Making excess changes is bad but, particularly for the website java script->API, and API->VV/VF steps, the input formats do not need to be the same as the output format at all. The API's input standard can float freely separate from either what we ask users for and what VV uses and the same for VV VS the API or the website users, at least if we use java script to manage user input.

If we had no constraints on time, or complexity, and wanted the ideal form for each exchange then - user->website is naturally best done in a plain text format java script->API may be best done in JSON (same with saving data to a file, which is not part of this example process) API->Variant formatter(optional) and VF->VV are best done in python lists

Given that this is a fairly major breaking change for backwards compatibility we want to think carefully before we jump and chose the right tool for our use cases. If we want the user input to no longer match the internal object then we also want to chose the right order to make the changes, probably working from the outside inwards, I think (i.e. API/website first).

If we skip java script transformation of user input, to avoid having to change or add complexity to the website then we have to impose whatever form the API uses on the non technical users of the website, which constrains what we want to do with the API.

If we want to use the same format for both the API and the VV object directly then we probably have to fix the VV object to use whatever works best for our APIs.

If we intend to pass the same format all the way up then I am much more in favour of using space as a separator, not JSON, if it does not break the website (and perhaps even if we have to fix the website to handle it). Usability is too important to us for me to feel comfortable doing otherwise, unless we gain something specific for making the users jump through hoops.

So when and where are we talking about using which method?

ifokkema commented 7 months ago

My 2 cents, although not VV specific;

John-F-Wagstaff commented 7 months ago

Thanks,

This is about what I was thinking for new code, and possibly our future, with the caveat that a JSON API won't work as an endpoint for a non javascript dependent form input.

Unfortunately we are somewhat constrained by our original design choices, each of which made sense at the time as an extension of the previous status quo. This passes the users input directly from the website(for non technical users), or API(for users with technical skills), down to the python object as a string, unchanged. We are making a compatibility breaking change at this point, so we can change things now, and we want to move towards a sensible direction, but each extra change represents an extra risk of breaking currently working stuff, also changing the API will impose on our end users. This adds to the complexity of the decision.

ifokkema commented 7 months ago

Some random ideas that may or may not be useful:

Peter-J-Freeman commented 7 months ago

This is a big breaking change to how we work, we don't want to just jump in, how do you intend to handle backwards compatibility, will we have an option to enable it for VV at least?

I did think about this. My plan was to ensure the old style pipe delimited format will still be handled. Basically, take my latest push as a V1. All tests pass. Now add new code that still accepts the | delimited. This was my next question, but I didn't have time today to present it. See next comment for how I want to do it

What do you think @John-F-Wagstaff

Peter-J-Freeman commented 7 months ago

For the website, etc, I would never allow/request JSON input. If I allow multiple values to be entered, I provide the user a textarea and will split the input based on newlines or, in the case of HGVS nomenclature input, any whitespace. Internally, I split this into a list and continue on my way.

Agreed. This is how we currently handle web inputs but string everything together with |. OK, it was a bad design decision, but it was what is was. My current plan is to handle conversion to JSON in the website code. The Web app and the API app will convert pipe delimited to JSON in the background. I want to keep the core VV package clean.

I could make the code pass a list directly, but the lesson from Aries is that we can expect JSON inputs, and since JSON is a string, it makes sense to pass lists in JSON format. For me anyway. Deserialisation is pretty damn rapid for Python.

I will also need to update VariantFormatter.

My reasoning here is to ensure we do not do this again, i.e. the core code change will serve as a reminder!

Both OK with the use of JSON, but keep it hidden from most users and to keep the old Pipe delimited method operating?

ifokkema commented 7 months ago

My current plan is to handle conversion to JSON in the website code.

Sounds good!

Both OK with the use of JSON, but keep it hidden from most users and to keep the old Pipe delimited method operating?

In case I'm part of "both", sounds good to me! This means I would be able to use JSON input through the APIs, right? And that this would take preference over pipe-delimited input, so I can, in the future, send variants containing pipes?

John-F-Wagstaff commented 7 months ago

If you want to keep on overloading the string to also be a list then JSON avoids the problems of '|' and since encoding/decoding escaped JSON strings is standard should be fine even if they add unexpected extras to the hvgs standard, so I am OK with this.

Despite that I am not particularly keen on stuffing extra variable types into a string, internally. If it wants to be a list I prefer it to be a list to start with, whenever possible, same with dictionaries. Doing otherwise (i.e. stuffing other variable types into a string) has caused weird uninformative bugs for me before. I like my input errors to happen on input, eg. within the API code to handle input, not buried in the later code i.e. in this case within variant validator itself, but this is your code and as much a matter of personal preference as anything else, as they say once bitten twice shy, so if you think it works best don't let me hold you back.

The rest of your description of your plan is perfectly fine with me.

Peter-J-Freeman commented 7 months ago

In case I'm part of "both", sounds good to me! This means I would be able to use JSON input through the APIs, right? And that this would take preference over pipe-delimited input, so I can, in the future, send variants containing pipes?

Correct, the satndard input would be JSON, but we would still accept the pipe delimited string for backwards compatability.

Despite that I am not particularly keen on stuffing extra variable types into a string, internally. If it wants to be a list I prefer it to be a list to start with, whenever possible, same with dictionaries.

I agree, however, the vast majority of the data we handle is strings. We could adapt the VV code to only accept lists, but we would need to re-write a lot of tests etc, and this would be a major major change to the way data is submitted. It also breaks interoperability with other languages. To me, making the core code accept JSON or string is the way to go. Ideally, strings will be converted to JSON before submission to the Core software. It also means we can make better plans to handle the |gom and |lom syntax.

I will put together a dev branch for VV (already complete) VF, the API and Web interface and we can discuss prior to deployment