monarch-initiative / genophenocorr

Genotype Phenotype Correlation
https://monarch-initiative.github.io/genophenocorr/stable
MIT License
4 stars 1 forks source link

Number Protein Regions #176

Open lnrekerle opened 1 week ago

lnrekerle commented 1 week ago

Update ProteinMetadata so that each Feature has a unique label. i.e. - if there are multiple Features labeled "Disordered", they would become "Disordered_1", "Disordered_2", etc.

Definition of Done (DoD): A unit test is created that has an example protein with multiple Features of the same name, an analysis passes and gets expected output for just one of the Features and it does not test with all of them.

EDIT: After discussion, we will not be numbering features and will instead list them in a table for users to select.

New DoD: A proteins table is printed with viewer that lists all protein features and the location on the amino acid strand. If there are similar features (i.e. "EGF-like 1", "EGF-like 2") given by Uniprot, on the protein strand visualizer, they will be listed as the same thing ("EGF-like") and will have the same color. They will still be listed separately in the table though.

ielis commented 9 hours ago

Hi @lnrekerle @pnrobinson

I am not sure if this is a good idea. I do not think we should edit labels assigned by the Uniprot data submitters. This can lead to perplexing situations where the domain labels in our visualizations will be different from what the user sees at Uniprot.

If we want to support testing of just one feature, then we must choose one of the following:

I strongly advise against modification of the data we receive from Uniprot.

pnrobinson commented 8 hours ago

@ielis Note that uniprot is not consistent For instance: https://www.uniprot.org/uniprotkb/P35555/entry. (Search for "EGF" -- there are 43 numbered domains) In other cases, the domains are not numbered. For downstream use cases, it would be better to treat all EGF as a group and maybe present a table with the positions of each motif so that the use can also correlate according to positions. I agree we should definitely not add numbering.

ielis commented 8 hours ago

OK, it is sad if Uniprot is not consistent but we cannot update the labels anyway on our side. Therefore, :+1: for

I agree we should definitely not add numbering.


Second, to support the following:

For downstream use cases, it would be better to treat all EGF as a group and maybe present a table with the positions of each motif so that the use can also correlate according to positions.

we can also write a VariantPredicate that takes a regular expression, e.g. EGF-like.* or a re.Pattern (even better), and tests the variant for overlap with any protein feature that matches the pattern.

pnrobinson commented 8 hours ago

For the graphic, all the motifs need to have the same color and label!

ielis commented 8 hours ago

That's OK and we must implement it as a separate business logic. However, if from any reason we would like to connect the visualisation and the predicates, please note that this is the first piece of :spaghetti: that will make it into the codebase.