nmdp-bioinformatics / phycus

Service used for curation of Haplotype Frequency
GNU Lesser General Public License v3.0
7 stars 23 forks source link

Add constraints to labels #98

Open mmaiers-nmdp opened 5 years ago

mmaiers-nmdp commented 5 years ago
  1. we want to constrain label types For example, a labelType of "ICCBBA ION" could refer to the data here We don't want label types of ICCCBBAAA ION etc.

  2. we want an explicit way to create new label types (and show what label types are in the database) Have a REST endpoint GET/POST LabelType Other label types: DOI - reference to a manuscript PMID - PubMed ID

sjmack commented 5 years ago

For 1. EBMT provides a PDF that lists a lot of IONs, but its there a more accessible (and potentially comprehensive) list (text-formatted maybe?) or repository that could be queried for IONs, or is EMBT the place (or is the www.iccbba.org/ document above the place to go?)?

We would want to validate the ION against something before sending it to the database.

For 2. above, I suggest we use the NCBI's PMCID - PMID - Manuscript ID - DOI converter to convert all provided PubMed IDs, PMC IDs, NIHMS IDs, UK IDs, etc. into DOIs. This can also be used to validate a provided ID, so that the client can reject the label if the ID is invalid.

However, what is to be done in cases where the haplotyping generator group does not have an ION and does not have a published reference for the data? This is not an uncommon issue with AFND, where unpublished data are loaded into AFND with no external citation.

mmaiers-nmdp commented 5 years ago
  1. List of ions is the xml in the link above based on this xld xsd https://www.iccbba.org/docs/tech-library/database/grid-issuing-organizations-xml-schema.xsd ION should be optional
kaeaton commented 5 years ago

The ION database exists only as either a xml document or an excel document (that could be exported to csv). The difference between the two is primarily that the excel version includes inactive facilities. There are a couple issues here:

A. They don't have any sort of flagging notification that they've updated the db. They keep a pdf log of the changes (they've added one facility a year since 2017), but there's no real way that I know of to ping and see if the files have changed short of redownloading them. Problems:

  1. We have no way of knowing when they add new data.
  2. We have no easy way of getting the data to check because it's file based, not a true database.

This second one brings up:

B. I can either download the xml and have the program parse it, or if we want the inactive facility IONs as well, download the Excel file and convert it into CSV file that gets physically saved with the program. (There's a resource folder that allows you to add non-Java files and access them within a compiled jar. It's how I added the help documentation.) In either case there are issues:

  1. If a facility has deactivated an ION, but still has data they (or someone else) want to upload using that old ION, we need the contents of the Excel file.
  2. If the facility is a new one, short of manually updating the program with a new CSV file, the only option is using the xml file and forcing it to re-download and reread the file. With this option we could conceivably put a button in options to force a re-download and parsing of the XML file. But that only works if iccbba has added the new facility to the db files.

The Phycus GUI currently does neither, but does screen to make sure the ION is valid per the iccbba naming conventions. (A four digit number that cannot start with a 0. So 1000 - 9999.)

kaeaton commented 5 years ago

Regarding the other labels: right now we have haplotyping entity (this used to include the ION, now they're separate labels), genotyping entity, and ION. I was planning on adding DOIs next using the generator Steve found for converting assorted other IDs into DOIs.

Haplotyping and genotyping entities are inherited fields and can be changed or dropped. (The Java CLI included them by default, but didn't actually include a way to specify them, they were hard coded in the function.)

Martin, when you say you want an explicit way to create new label types, were you thinking of something like populations where they have to manually add them before using them? What about the values of these labels?

Also, we still need a place to put some sort of attribution data. Would putting that as a label be a valid option? And if yes, how do we want to do that? Name? Phone/email/address? If all of the above, separate labels for each? According to curation-swagger-spec.yaml there's a way to pull this information back out of the database, so having a tab with this all in it is an option. Maybe a dropdown with the available labels in it, that, when selected, shows the values found in the database associate with that particular label?