Pre-submission query: R Interface to the SNOMED CT terminology service

peterdutey commented 1 year ago

Submitting Author Name: Peter Dutey-Magni Submitting Author Github Handle: !--author1-->@peterdutey@AnikaC-git<!--end-author-others-- Repository: https://github.com/ramses-antibiotics/snomedizer/ Submission type: Pre-submission Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: snomedizer
Type: Package
Title: R Interface to the SNOMED CT Terminology Server REST API
Version: 0.3.0
Date: 2022-07-08
Authors@R: c(
    person(given = "Peter",
           family = "Dutey-Magni",
           role = c("aut", "cre", "res"),
           email = "p.dutey-magni@ucl.ac.uk",
           comment = c(ORCID = "0000-0002-8942-9836")),
    person(given = "Anika",
           family = "Cawthorn",
           role = c("rev", "res"),
           email = "a.cawthorn@ucl.ac.uk",  
           comment = c(ORCID = "0000-0002-2438-7495")),
    person("University College London", role = c("cph")))
Description: Interrogate the SNOMED CT clinical ontology using the 
   SNOMED International Terminology Server REST API <https://github.com/IHTSDO/snowstorm>.
URL: https://github.com/ramses-antibiotics/snomedizer
BugReports: https://github.com/ramses-antibiotics/snomedizer/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Depends:
    R (>= 3.0.0)
Imports: 
    jsonlite,
    purrr,
    dplyr,
    httr,
    methods,
    progress,
    Rdpack (>= 0.7)
Suggests: 
    testthat,
    tidyr,
    knitr,
    magrittr,
    rmarkdown,
    covr,
    pkgdown (>= 2.0.0),
    R.utils,
    oysteR
RoxygenNote: 7.1.1
VignetteBuilder: knitr
RdMacros: Rdpack
Config/testthat/edition: 2

Scope

Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

Data Lifecycle Packages
- [X] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] data validation and testing
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [X] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [X] text analysis
  
  Statistical Packages
- [ ] Bayesian and Monte Carlo Routines
- [ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
- [ ] Machine Learning
- [ ] Regression and Supervised Learning
- [ ] Exploratory Data Analysis (EDA) and Summary Statistics
- [ ] Spatial Analyses
- [ ] Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

This package is designed to democratise access to the global standard healthcare terminology resource, SNOMED CT. It provides an R interface to the Snowstorm terminology server, an Java/ElasticSearch open-source application that is maintained and supported by SNOMED International on the basis of well-established technical standards.

If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package?

Not applicable.

Who is the target audience and what are scientific applications of this package?

This package is aimed at healthcare analysts and health service researchers who are primarily using R and dplyr for their work. The package will have important uses in biomedical research to allow users easy access to healthcare terminology and ontological reasoning. It does not assume prior knowledge of ontological reasoning or full-text search engines.

Draft manuscript introducing the package

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

No.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Yes. The documentation informs users that sensitive personal information should not be processed with this package unless behind a firewall.

Any other questions or issues we should be aware of?:

Unit testing and vignette building relies on a public remote API (https://snowstorm.ihtsdotools.org/snowstorm/snomed-ct/swagger-ui.html or https://browser.ihtsdotools.org/snowstorm/snomed-ct/swagger-ui.html).

SNOMED International (IHSTDO) and the Snowstorm developers are informed of this project.

Many thanks in advance for your feedback! Peter Dutey

emilyriederer commented 1 year ago

Hi @peterdutey ! Thank you for submitting your package to rOpenSci. As we consider fit, I have a few follow-up questions for you.

Could you please elaborate more on how you see this package fitting the "scientific software wrappers" and "text analysis" categories? I currently understand the functionality to focus on retrieving data from the SNOWMED API and (optionally) structuring it into a dataframe which I would judge to be aligned with the "data retrieval" category. Further descriptions of the categories and some linked examples may be found here.
We have recently been discussing internally when and how API wrapper packages add the most incremental to the research community and just released a new blog post on this issue. I see that your manuscript also addresses this, but I'd appreciate any additional thoughts you can share on how the wrapped API eases the user experience. Do you see the benefits more on the technical side (e.g. formulating the request, pagination, etc.) or encoding domain-specific context (e.g. making endpoints more discoverable and documented)?

Thanks!

peterdutey commented 1 year ago

Hi @emilyriederer,

Thank you for the prompt response and please find answers below.

If we must select a single category, then it would arguably be data retrieval. The other two may however also apply. Text analysis: the package can be used to do some very basic named entity recognition, particularly as and when the Snowstorm team release a new feature we requested, which would allow the user to run Elastic multi term fuzzy queries via the REST API. Scientific software wrapper: we fulfil it by offering an API to Snowstorm that is fit for purpose for clinical researchers as they are mostly familiar with R.
The motivation behind this package is to remove some very stubborn obstacles to the use of SNOMED CT, which are both human and technical.

Most clinical research is done by junior doctors or other junior researchers. Those have to juggle clinics/lab work and come from other fields than computer science, meaning they probably do not know what an ontology or a REST API are or how to use it. Yet they have a basic use for both. We thus need to make it easy for them to perform basic analysis in the limited time they have available.
Insufficient visibility. Most people in the target community have heard of SNOMED. But they think it’s just a list of terms. Most ignore it is underpinned by a fully-fledged ontology, because they’ve never seen it in action. There’s a lack of tools out there that ‘surface’ the capabilities of SNOMED in the first place, and invite people to invest into learning more.
Related to the above, widening the use of Snowstorm is overdue, and that presupposes a human interface designed around the needs of a new user group. Very few people have heard of Snowstorm, surprisingly even in the informatics community, despite it being the reference terminology service. The Snowstorm API consists of a large number of operations which are mostly undocumented on the Swagger interface, possibly because Snowstorm was designed as a backend to other applications, rather than a service for non-specialist end users. As a result, we find that the non-specialist user struggles to work out what each operation does and which one to use. The research we’ve done has led to boiling the Snowstorm features into just six wrapper functions (thanks to function overloading). These are much more intuitive and feature original documentation that does not exist anywhere else (to date).

To reference your blog, we are not in a situation where a like for like replica of the Snowstorm API in R would have addressed the community problem – it needing simplifying and designing with a specific community in mind, namely those clinical researchers and professionals enabling the research (healthcare analysts). We need this package to create new opportunities to teach/self-learn SNOMED CT, the Expression Constraint Language, and basics of reasoning in ontologies.

There are also future plans for embedding additional knowledge within snomedizer. The first one coming up is incorporating datasets providing look ups to other vocabularies used for medical products in large research databases such as the UK Biobank and the Clinical Practice Research Datalink. This will come with a special vignettes/tutorial on medicines.

I hope the above makes sense but please do not hesitate to seek clarification and ask more questions. Many thanks in advance Peter on behalf of the Snomedizer team

emilyriederer commented 1 year ago

Thanks @peterdutey ! I really appreciate the detailed and thoughtful reply. All of the additional context is very informative and, as someone outside of this domain, really helps me understand the value of this package. I think this is definitely in scope.

I did check in with the team. Since all of the features that you mention for "text analysis" and "scientific software" are coming from the API, we do consider these to be part of the "data extraction" category whereas the others would more relate to text analysis functionality built within the package and/or non-API wrappers (e.g. of a command line tool).

I'll close this issue for now, but please proceed to a full submission at your convenience!

ropensci / software-review

Pre-submission query: R Interface to the SNOMED CT terminology service #548

Scope