nhsengland / open-health-statistics

Statistics on open source healthcare repositories
https://nhsengland.github.io/open-health-statistics/
MIT License
28 stars 10 forks source link

Health Sector Git Repo Topic Ontology #35

Open SamHollings opened 1 year ago

SamHollings commented 1 year ago

Health Sector Git Repo Topic Ontology

!!! tip "TLDR"

??? question "Why should we care?"

??? success "Pre-requisites"

A key aim of RAP is to not only automate our pipelines to re-use useful code in other work. This relies on us publishing the code as publicly as possible, and then making it easy to find these useful bits of code. Topics in github can help with this, however we will get the most benefit from topics by using a common topic vocabulary to describe our GitHub code repos.

The topic ontology described in this guide will ensure our code can be searched by:

!!! warning

The Differences between "topics" and "tags"

In GitHub, tags and topics are different:
    - **Topics** are labels applied to whole repos which describe them, like keywords. Each repo can up to twenty, and github is good at searching and sorting results by topics.
    - **Tags** are labels applied to specific commits within a git repo, and it's how releases are made, e.g. v0.1.0 might be a tag applied to a specific commit locking in that this commit is Version 0.1.0.

Topics

Our aim with topics is to allow people to find code which might be useful to them, so they can reuse it. With this in mind, they usually want to know what kind of data the code was used on, in which language, if it was using the compatible datastructures (e.g. pandas, or pyspark) and how recently it was made / updated (people are less trustworth of ancient, dead code).

When applying topics to your code:

Priority Category Description Example topics
1 Domain Area/ Datasets/ Data source People will want to know what data these techniques have been applied to, if any. This might inspire them to do something similar, or highlight areas for collaboration. secondary-care
primary-care
hospital-episode-statistics
gpdpr
civil-registration-of-deaths
gdppr
artificial (perhaps if it was using artifical data)
1 Technique People will want to what kinds of data processing, analyses, etc. were done - this might be quite broad as it should cover the sorts of resuable code chunks people might want to look at. clustering
forecasting
classification
regression
statistical-disclosure-control
deduplication
entity-resolution
record-linkage
summarisation
data-cleansing
data-validation
hyperparameter-tuning
artificial-data-generation
etc.
2 Technology if I want to re-use someones Python or R code, and they made it using a different data structure to me, that might cause problems, hence it's important to describe them dplyr
sparklyr
pandas
pyspark
polars
sqlalchemy
sqlalchemy-orm
numpy
sklearn
tensorflow
pytorch
scipy
etc.
2 Language People often want to know if the code is using a language they know/use, and though GitHub can sometimes correctly identify the language used in the repo, if you have a lot of documentation or use certain languages (such as SQL), it can struggle. python
r
sql
2 Maturity People might want to know if a codebase is made to a high standard, or by people who are just starting out. baseline-rap
silver-rap
gold-rap
2 Opt-out of re-use A tag for those people who want to publish their code, but make it clear that it is not optimised for re-use. not-optimised-for-reuse

Using topics to find useful repos (and code)

You can search for repos by topic within github using the search bar (e.g., as seen here, with tips on github search syntax here) or you can use this helpful website which gathers the repos and topics from the various NHS organisations on GitHub.

SamHollings commented 1 year ago

feedback from Jonny: Consider removing the Meta-tag - we will already filter by organisation anyways, and the maturity tag basically fulfils the same role.

JRPearson500 commented 1 year ago

Suggestion to have an opt-out (black list) topic rather than a white-list meta tag

JonathanHope42 commented 1 year ago

Suggestion to have an opt-out (black list) topic rather than a white-list meta tag

maybe "not-optimised-for-reuse"

JRPearson500 commented 1 year ago

Technique and Domain suggested as the key areas that need topics mandating.

lilianavalles commented 11 months ago

I'd find useful the release status, WIP, done and ready, active (continuously being improved), and inactive (WIP but with no plans to keep working on it).

GiuliaMantovani1 commented 11 months ago

Some of these might be an overkill, but just for consideration... Specific types of algorithms? For example, in Data Linkage you could have FS but also other types of Bayesian algorithms. Database used? Development tools? In an ideal word this would not impact how the code is written, but sometimes it does... can be probably understood from level of rap?

SamHollings commented 11 months ago

@lilianavalles

I'd find useful the release status, WIP, done and ready, active (continuously being improved), and inactive (WIP but with no plans to keep working on it).

Some interesting suggestions. A question I have... would they help you find useful code? For example, you can see if a repo is still "active" by whether it has been updated recently, so perhaps a "topic" for this doesn't add so much (though it would make iti slightly faster to see). They would indicate the code was potentially higher quality... but I suppose so would the "gold, silver etc." RAP topics...

@GiuliaMantovani1

Some of these might be an overkill, but just for consideration... Specific types of algorithms? For example, in Data Linkage you could have FS but also other types of Bayesian algorithms. Database used? Development tools? In an ideal word this would not impact how the code is written, but sometimes it does... can be probably understood from level of rap?

I think specific types of algorithms might be good, but I wonder if by making the topics too granular we reduce their usefulness? It's difficult to know where to set the threshold though - so if you think it would benefit people to add those in, we can try it. I think database used, e.g. databricks, chroma, postgres, SQLserver, probably is important, under technology, as it really will affect how the functions are written if they're useful to you. Development tools, do you mean jupyter, Vscode, etc.? I think if they are very... proprietary, such as Databricks, and ipynb notebooks more generally, might be good to add as a "notebook" topic. But Vscode / Pycharm, probably shouldn't change how the code is... I'd assume!

SamHollings commented 10 months ago

I'm going to move this to the RAP website and then we can continue to develop it there.