DISCUSSION: Add column(s) to vocabularies table to identify vocabularies that came from a specific database

marc-outins commented 6 years ago

Since we are going to adding database specific vocabularies I think we should store what database they came from in the vocabularies table.

aguynamedryan commented 6 years ago

So, say we have two SEER databases, ALL and CLL, which share some of the same vocabularies, what do we put in this column?

marc-outins commented 6 years ago

SEER since ALL, CLL, etc. is just a cut of the whole SEER Medicare database. One question is do we distinguish between SEER and SEER Medicare. Currently vocabs that come from the SEER medicare data I'm giving ids of SEER_ .

aguynamedryan commented 6 years ago

So is this column storing the data vendor's name? Or the dataset name?

marc-outins commented 6 years ago

whatever we want to call it, definitely didn't mean we would have different vocabularies for each version of a database if said vocabulary is the same across databases. But it would be nice to know the source of the vocabulary.

markdanese commented 6 years ago

I think we need to simply name the vocabulary properly and have a description or source field. SEER is confusing because some things come from SEER, some are NAACCR, some are adapted from NAACCR, and some are adapted from other sources (e.g., AJCC). In other words, just because a vocabulary is in SEER doesn't mean it comes from SEER. I think this would be better called "source", which may, or may not, relate to the database precisely. I think source could be more of a description to say something like "NAACCR grade adapted by SEER version 2".

markdanese commented 6 years ago

A related example would be something like CMS place of service codes which may be in Medicare or other databases. But the source description is "CMS place of service". In other words, I think we should say what it is, and where it comes from in a way to identify it clearly.

marc-outins commented 6 years ago

In the CMS place of service situation the vocabularies table would contain a vocabulary for CMS place of service and the "Source" column would be CMS and we are doing that with any vocab use in SEER that is defined somewhere else. I'm talking about vocabularies that are defined by the organization who cut the data. So for SEER Medicare there is a variable marst1-10 which contain marital status as defined by SEER. So I create a vocab called SEER_MARST.

markdanese commented 6 years ago

In that case, I would call the source SEER (or NCI, but I prefer SEER since NCI might have more than 1 version). Having the vocab name include the source, when relevant, is a good reminder. Although I wonder if we should be more precise about this.

Consider a variable for Sex, which will be in every database and defined differently in many, but not all. Do we name them differently for each datasource? In other words do we give them more generic names like "Sex_M_F" and "Sex_0_1" and "Sex_Male_Female"? Or do we call them all "Sex" and then use a source field to distinguish them? Or do we need a "type" field to categorize them?

I don't think there is a perfect answer here. I think we need something that can be clearly implemented and searched. So, I guess I am leaning toward something like "Vocabulary Name" (SEER_MARST or CMS_SEX), Source (SEER or CMS), Type (Marital Status or Sex), and "Description" (SEER marital status variable or CMS sex variable).

Obviously in these examples, there is redundancy, but in other situations it will be helpful. I am thinking of Vocabulary = "SEER_grade", Source = "SEER", Type = "Cancer grade", and Description = "NAACCR grade adapted by SEER".

outcomesinsights / generalized_data_model

DISCUSSION: Add column(s) to vocabularies table to identify vocabularies that came from a specific database #112