jhu-bids / termhub-csets

value|concept|code set content for TermHub app
0 stars 0 forks source link

Directory and file structure #1

Open joeflack4 opened 2 years ago

joeflack4 commented 2 years ago

We decided some basics on how to split up TermHub related git repositories, as well as the directory structure of the termhub-csets repo.

Repositories

  1. termhub-csets repo: For value set storage.
  2. termhub-vocab: For storage of vocabulary artefacts.
  3. TermHub: Various code (CLIs, libs, clients), Services (APIs, UIs) as needed for TermHub. Can split this into several repositories, or keep it as a mono-repo. The aim will be to also have (1) and (2) be git submodules of this repository.

termhub-csets directory structure

Basic structure

- README.md
- indexes/
  - concept_set_names/
  - mappings/
  - bundles/
  - jobs/
- value_sets/
  - csv/
    - n3c-format/
      - code_set.csv
      - concept_set_container_edited.csv
      - concept_set_version_item_rv_edited.csv
    - dih-format/
  - json/
    - <dih_id>/
      - v<#>/
        - <dih_id>.json

Full notes about structure

File/directory structure

- README.md
  - (This would be auto generated)
  - Contents
    - Concept set list
      - Each concept set would have a link to its github folder in this repo
- indexes/
  - concept_set_names/
    - concept_set_name_index.csv
      - id | url? | name
      - <id of the thing> | <url pointing at its github folder> | name
  - mappings/
    - sssom/
      - <tsv files>
    - fhir/
      - <json files>
  - bundles/
    - <bundle name>/
  - jobs/
    - (this would be storing information about 'jobs' run)
    - jobs.csv
      - job_id | date | <; delimited dih_ids> | ...(what other fields?)
- value_sets/
  - csv/
    - (other ideas)
      - <other format?>/
    - n3c-format/
      - code_set.csv
      - concept_set_container_edited.csv
      - concept_set_version_item_rv_edited.csv
    - dih-format/
      - (What to call this folder?)
      - concept_sets.csv
        - Is this a good name for the csv?
          - other name candidates
            - value_sets.csv?
            - cset.csv?
        - Tall/tidy format?
          - dih_id | field | value
          - 12345 | source | vsac
          - 12345 | oid | <oid>
          - 12345 | name | <my concept set name>
  - json/
    - (other ideas)
      - <source>_<id>/
        - concept name: Should not have certain special characters
        - (examples)
          - vsac_<oid>
          - hcup_<hcup_id>
            - hcup_id == cssr_code
          - dih_<dih_id>
        - <v#>/
          - <source>_<id>.json
    - <dih_id>/
      - v<#>/
        - <dih_id>.json

Related issues

https://github.com/HOT-Ecosystem/ValueSet-Tools/issues/45

joeflack4 commented 2 years ago

Today at the TIMS meeting, we discussed this topic. Consensus was that for ValueSets, FHIR JSON is likely the best central, canonical data storage format, but that LinkML would be best once that feature is added. See feature request issue: https://github.com/linkml/linkml/issues/905 @Sigfried FYI. Would have loved you here for this discussion but we can talk about it once you get back from vacation.

joeflack4 commented 1 year ago

@Sigfried Just dumping this here. Davera linked this on TIMS Slack. It's about a standard for a "FHIR Terminology Repository". I'm thinking we might want our subset of FHIR content to conform to this.

https://docs.aidbox.app/terminology/fhir-terminology-repository/ftr-specification

174470635-4286308f-fd41-4943-83a5-b351289a926c
hugwuoke commented 1 year ago

@joeflack4 is there anything here that we would like to revisit?

joeflack4 commented 1 year ago

Hmm, I'm gonna re-open this but leave it off the project board. I'd actually prefer it in todo as low urgency but I think Siggie wants to minimalize the project board more.

There are a few different things here: