monarch-initiative / monarch-phenote

stub for monarch phenote
4 stars 3 forks source link

HPO-mode WebPhenote support - TSV Import/Export, additional columns #46

Open DoctorBud opened 7 years ago

DoctorBud commented 7 years ago

This is the summary of the desired behavior @pnrobinson needs to proceed with using WebPhenote effectively for curation.

The current plan is to enable Peter to import an existing TSV, edit and extend it within WebPhenote, and to then export the new model to a TSV. Some columns need to be preserved round-trip, even though they may not be visible or editable in WebPhenote.

1) These are the columns: 'Disease ID', 'Disease Name', 'Gene ID', 'Gene Name', 'Genotype', 'Gene Symbol(s)', 'Phenotype ID', 'Phenotype Name', 'Age of Onset ID', 'Age of Onset Name', 'Evidence ID', 'Evidence Name', 'Frequency', 'Sex ID', 'Sex Name', 'Negation ID', 'Negation Name', 'Description', 'Pub', 'Assigned by', 'Date Created'

2) All of the ‘XXX Name’ fields do not need to be ‘preserved’ round-trip (through import/export), as long as I can ensure that the ‘XXX Name’ field is populated with the name derived from the corresponding ‘XXX ID’ field. So the actual fields to be preserved round-trip are:

'Disease ID', 'Gene ID', 'Genotype', 'Gene Symbol(s)', 'Phenotype ID', 'Phenotype Name', 'Age of Onset ID', 'Evidence ID', 'Frequency', 'Sex ID', 'Negation ID', 'Description', 'Pub', 'Assigned by', 'Date Created',

3) I’m not sure what to do with Genotype and Gene Symbols. Presumably, Genotype will be something like 'MGI:3711884’ (https://monarchinitiative.org/genotype/MGI:3711884) and the ‘Gene Symbols’ for that would be 'Gas1/Gas1; Shh/Shh<+> [involves: 129S1/Sv 129X1/SvJ C57BL/6J]’, which is derivable from Monarch and therefore I don’t need to store it.

4) The Gene and Genotype columns are NOT going to be visible or editable in WebPhenote, but they must be preserved round-trip.

5) The ‘Assigned By’ field will not be visible or editable, but must be preserved.

6) Negation, Frequency and Sex columns need to be added to WebPhenote and made visible and editable.

7) Summarizing (assuming anything VISIBLE/EDITABLE is preserved):

VISIBLE/EDITABLE 'Disease ID', VISIBLE/DERIVED 'Disease Name', HIDDEN/PRESERVED 'Gene ID', DERIVED 'Gene Name', HIDDEN/PRESERVED 'Genotype', DERIVED 'Gene Symbol(s)', VISIBLE/EDITABLE 'Phenotype ID', VISIBLE/DERIVED 'Phenotype Name', VISIBLE/EDITABLE 'Age of Onset ID', VISIBLE/DERIVED 'Age of Onset Name', VISIBLE/EDITABLE 'Evidence ID', VISIBLE/DERIVED 'Evidence Name', VISIBLE/EDITABLE 'Frequency', VISIBLE/EDITABLE 'Sex ID', VISIBLE/DERIVED 'Sex Name', VISIBLE/EDITABLE 'Negation ID', VISIBLE/DERIVED 'Negation Name', VISIBLE/EDITABLE 'Description', VISIBLE/EDITABLE 'Pub', HIDDEN/PRESERVED 'Assigned by', HIDDEN/PRESERVED 'Date Created',

cmungall commented 7 years ago

@pnrobinson or @drseb - can you confirm that we need gene, genotype?

pnrobinson commented 7 years ago

We have a very few entries that have some genotype information in them, but I think we should probably discard this information and try to do it better with new formats -- probably phenopackets plus VMC genopackets. The gene ID is redundant because we have this from the OMIM ID or whatever disease+gene ID we are going to use. @drseb Sebastian, please confirm this, do we have any information in these fields that we need to save?

cmungall commented 7 years ago

OK, good, I thought so. Of course we will have the originals archived so if we need to mine anything from them we can. But moving forwards.

cmungall commented 7 years ago

What is the TSV requirement?

As originally envisioned, the web interface would not need to deal with this. There would be an external procedure for importing en masse (e.g. from @drseb's NER pipeline) and an external procedure for exporting this for various formats for publication (including the TSV format distributed on the HPO site, and optionally PXF).

pnrobinson commented 7 years ago

If everything worked perfectly, then we do not need the multiple TSV files. But is there any reason not to keep things this simple, especially until we have tested the system? I am a little concerned about putting all of this information into OWL because it seems a massive overkill for what is classic TSV data. Maybe we should have a telconf about this?

jmcmurry commented 7 years ago

I would be eager to get the core requirements nailed down and prioritized as soon as possible whether by teleconf or other means. Peter, I think we are all agreed that OWL as a representation is not required for your current work. However, I personally would like to understand whether you think that ontological constraints (for things like frequency modified as a qualifier) are overkill too. Could you please clarify?

pnrobinson commented 7 years ago

Julie, can we skype and define what kind of document you need -- I will then draft it on googledocs to get things started. I do not understand exactly what you mean by ontological constraints; in any case, I think that there are only syntactical constraints on fields like frequency, but in the future we can improve on this.