openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

VV Ensembl phase 2 #451

Open Peter-J-Freeman opened 1 year ago

Peter-J-Freeman commented 1 year ago

Is your feature request related to a problem? Please describe. We need to finish off the Ensembl variant build

Outstanding tasks Build more tests Finish off a base set of tests for Variants in transcripts with gapped alignments for Ensembl data and see what the code does Adjust the code based on discussion

Main team @John-F-Wagstaff will take over the lead on this @sbenny1230 will hand over and can still be involved if time allows

End User comments @ifokkema @leicray

Collaborations Will be added

John-F-Wagstaff commented 1 year ago

Hello @sbenny1230 I hope you are doing well, I have had a look at your branch and so far it looks good. Are there any differences in usage VS the unmodified version? Also are there any pitfalls that you want to warn me about, or problems that you have so far not yet solved, other than the problems with gaped alignments and the number of tests? So far I am thinking of making a "main Ensembl dev" branch, using yours as a basis, then branching off this for my own commits, would this be OK with you?

With the gapped sequences I can clear one thing up, with ensemble only giving the spans as input, and asserting themselves as identical to the genome, we used to build the alignments assuming that the two are identical. I am afraid that the higher levels of the code are likely to follow garbage in garbage out principles if these assertions do not hold. Fixing this will probably need some work on the database side of things, i.e. we are going to need to add some checks, and some alignment stuff to the Ensembl data processing, and then re-load the Ensembl data into the database, again. Depending on how this goes we may be able to avoid any extra Ensembl specific gap code in VaraiantValidator for this, or limit any such code to just warnings, by getting the input right, if people think it is OK to work from this angle.

@Peter-J-Freeman do we want to use bug #416 to curate a list of known variants for testing and development, rather than clogging up here? We probably already have enough to start with, but did you have any source of these other than literature search, and personal reports? If you already have a existing list of known problem transcripts, then if we can find a ENS transcript that does not match the genome but does match a specific RefSeq transcript, in your existing list, then that would be the absolute gold standard for our tests.

Peter-J-Freeman commented 1 year ago

Hi @John-F-Wagstaff . Agreed, use old issues rather than creating new where possible. I can send you the dissertation from @sbenny1230 as a starting point

John-F-Wagstaff commented 1 year ago

The dissertation would be good, if that is OK with both of you.

sbenny1230 commented 1 year ago

Hi @John-F-Wagstaff @Peter-J-Freeman, apologies for not replying sooner. I didn't notice this until now.

As far as I am aware, it's just the issues mentioned about the gap alignments and those I have mentioned in my dissertation. I can send it to you now if you haven't received it yet. A few of the test cases I have created for the Ensembl inputs should fail. These will be the issues I found during the project.

In the meantime I can take a look at the project again and my notes to see if there's any other issues I spotted but haven't noted down.

Happy to do a call to talk through anything. I'm working full time right now so I can only really work on the project in the evenings but I'm happy to still contribute to it.

John-F-Wagstaff commented 1 year ago

Hello @sbenny1230 thank you for getting back to us on this, I hope you are doing well in your work. I would appreciate any extra issues you noted down outside your dissertation, when you have the time to have a check. Once I have been through the existing list of issues in your dissertation, and made git issues for them, it would also be helpful to have some pointers as to whether I interpreted them correctly, or missed something out.

@Peter-J-Freeman I assume it is OK to use this issue as a meta-issue to keep track of the individual tasks? if so then once I have had a look at the list of known issues I will start posing them, and getting them coordinated through here. We can plan specifics and sort the task priorities once we have a full picture of the tasks needed.

Peter-J-Freeman commented 1 year ago

@John-F-Wagstaff . Use the issues in whatever way works for you