r-quantities / substances

Substance-aware unit conversions
MIT License
0 stars 2 forks source link

New package concept: parametric conversions #1

Open billdenney opened 8 months ago

billdenney commented 8 months ago

Related to r-quantities/units#134 (and others)

I'm not sure how to best discuss this. In the end, I don't think that it will be part of the units library, but I would like to engage both @Enchufa2, @edzer, and others to get this solution right. I hope that you think it's okay to have (or at least start) the discussion here.

The issue of units that are often not convertible often comes up as evidenced in r-quantities/units#134 and several other issues linking there. It signals a need to have a method of keeping different unit conversions separated. A typical example is mass-to-moles conversions that happen in many fields. For the data I work with (often laboratory measurements of blood tests), other types of conversions can exist like activity to molar conversions (e.g. conversion of 1 mole per hour of X is means that the concentration of Y is Z moles/L).

To accomplish this, I think that the best method would be the creation of a new package that would enable the following:

Are there other features that should be supported?

edzer commented 8 months ago

Cc @henningte

Enchufa2 commented 8 months ago

Thanks for spawning this, @billdenney. Currently,

So if loading several units systems at the same time is not a requirement, I think that the current set of features of the units package are enough to implement such a set of requirements. Otherwise, a bit more work will be needed.

But regardless of this, a separate package to implement such XMLs would be a nice starting point. We can host it here, of course.

billdenney commented 8 months ago

Thanks for the quick thoughts.

I think that loading multiple unit systems at the same time is a requirement (at least it is for my use case of laboratory measurements). My initial thought of the implementation would look something like the following:

Beneath the surface, I was not thinking of trying to load XML. The way that I understand both UDUNITS and units, that would not allow multiple simultaneous systems to be loaded. So, we would not be able to have both "carbon dioxide" and "carbon" loaded at the same time. For my clinical laboratory example, I have a project now that needs to have 515 different systems (that is the largest example that I've ever had, but about 100 different systems at a time is normal).

My thought was that the unit system table would be a data.frame that looks something like the following:

system unit_common unit_alternate slope intercept
carbon_dioxide mol g 44.01 0
carbon mol g 12.01 0
gamma_glutamyl_transpeptidase U ukat 0.01667 0
sodium mol g 22.989769 0
sodium mol Eq 1 0

There are a couple of notable problems that will immediately show up:

Thanks for the offer to host it within r-quantities!

edzer commented 8 months ago

Calling such packages units.xxx suggests they are about units. I think they are not, but rather about conversion constants that depend on the characteristics of the matter the unit refers to. Thinking for instance along the lines of BIPM SI, it seems there are no real "units for domain xxx". Could you think of (a) more descriptive package name(s)?

billdenney commented 8 months ago

That's a good point that they are not units like the BIPM SI units. They are more accurately described as unit conversions. I suggested the names of units.xxx to clearly link to this package.

When I just did a brief look at the list of R packages, I think that looking to highlight "unit conversion" would likely get people to the right place. So, perhaps unitconv.xxx for the specific packages and unitconv.systems for the support package? (I don't have a strong opinion here; my only goal is to make it easy for people to find the packages.)

edzer commented 8 months ago

I think that is a good idea; the linking to "units" will be clear from the package dependency and from when it's source repository is in this GH organisation.

Enchufa2 commented 8 months ago

The basic support for several units systems must be provided by the units package. Otherwise, such a "systems" package would basically need to reimplement units. So "systems" may or may not be required.

To try to shape what's needed, I would need a sort but comprehensive example of the set of conversions that would be defined for a couple of analytes as well as the set of operations (within and across analytes).

Enchufa2 commented 8 months ago

Because, @billdenney , when you say "515 units systems", you really mean 515 conversions between some form of unitless parametric quantity, such as moles, to some unit such as grams, right? Because that's not really a units system. ;)

Enchufa2 commented 8 months ago

Also (I'm re-reading the previous discussion in r-quantities/units#134 ): it would be helpful to know your current workflow and what changed our what prompted you to raise this proposal. Because (I'm just thinking out loud here) if your current workflow works but it requires e.g. complicated parsing, that could just be abstracted in a separate package. But if you hit some fundamental limitation, then it would be helpful to know it.

billdenney commented 8 months ago

@Enchufa2, I was thinking that the "systems" package would do two main things (to accomplish the list of goals above described below), and by prevent rework needed to implement the jumps between analytes.

I have been using the word "systems" because they are multiple disconnected sets of conversions, therein a "system". It seems like the word "system" is causing issues for this discussion, and I'm happy to choose a different word, but I don't know what that better word would be.

My specific, typical workflow is that I receive data with three columns (among many more) where one column is the analyte (e.g. "LDL cholesterol" or "sodium"), one column is the measurement value as a number, and one column is the units. Some of the units may be the same when analytes are different (e.g. "150 mg/dL LDL cholesterol" and "130 mg/dL sodium", both have units of "mg/dL"). I then need to convert both to standard units of "mmol/L", but the conversion for each is different:

My current workflow is always one-off. Nothing has really changed other than the fact that I had some more thoughts about how to generalize the solution better than I previously had considered. My workflow to standardize the units for a measurement is that I look at the dataset and make individual case_when() calls like the following to set the values and units:

library(tidyverse)
library(assertr)

data %>%
  mutate(
    value =
      case_when(
        analyte == "sodium" & unit == "mg/dL"~value/2.2989769,
        analyte == "LDL cholesterol" & unit == "mg/dL"~value/38.66976,
        TRUE~value
      ),
    unit =
      case_when(
        analyte == "sodium" & unit == "mg/dL"~"mmol/L",
        analyte == "LDL cholesterol" & unit == "mg/dL"~"mmol/L",
        TRUE~value
      )
  ) %>%
  group_by(analyte) %>%
  mutate(
    unit_count = length(unique(unit))
  ) %>%
  ungroup() %>%
  verify(unit_count == 1)

I then will relatively often use one measurement type and combine it with another. Such as, I may have the concentration of sodium (mg/mL) in urine and the total urine volume for the day (mL/24 h) and multiply them together then convert the units to mmol/day. And, I may have many analytes in the urine where I want to do this (sodium, potassium, glucose, albumin [a protein], etc.).

The unitconv.systems (or whatever it is called) would:

  1. Enable creation of subclasses for each "system" (or whatever we want to call it) so that "LDL cholesterol" does not accidentally become "sodium". There would be support functions for creating a class called "unitconv.systems LDL cholesterol" and "unitconv.systems sodium", etc. (The class names would look messy similar to those shown here-- not like typical class names because they would store the analyte name without modification to prevent accidental collisions.) These subclasses would be storable in a single vector, similar to the mixed_units class in the units package.
  2. Find the way to jump across SI unit boundaries (or any defined unit boundaries, e.g. the "ukat" example above) to allow conversion from mass to moles, for example. And also, find the way to jump between analytes (the "carbon dioxide" to "carbon" example above.) This would happen by the (simplified) directed acyclic graph search described above.

I was suggesting that these two features would exist in a separate package because they could be shared across many inherited packages (e.g. unitconv.clinical, unitconv.chem, etc.).

(I think that I covered everything you just asked for, but if I missed something, please let me know.)

Enchufa2 commented 8 months ago

I think that the main obstacle to this discussion is that we are merging low-level concepts and implementation details with requirements, and as a result we are going in circles here. Please forget for now about systems, boundaries, vectors, subclasses, and acyclic graphs, and let's talk about the workflow, the high-level interface. The code above is what you do now, so let's define what a better workflow should look like. Then we can assess what's available and what's missing, and what would be the best implementation.

Also: I've changed the title because, if we want to generalize this, I think we should be talking about parametric conversions instead of systems. Correct me please if I'm wrong, but every single conversion you are dealing with involves some kind of parametric unit, such as the mole, that requires a different parametrization (e.g. mol/g) for different substances.

billdenney commented 8 months ago

Good point about starting with the requirements. Thanks.

My high-level workflow is:

Does that clarify the workflow sufficiently?

And yes, "parametric units" are what I'm talking about throughout this discussion. Thanks for helping clarify the terminology.

Enchufa2 commented 8 months ago

Thanks for the clear specification of the workflow. Let's say that such columns are analyte, value, unit. What would you expect your code to look like? Maybe something along the following lines? (Don't take the new function names too seriously for now, I'm brainstorming).

library(<new package>)

set_substance_conversions(<data frame of parametric conversions>)

df |>
  mutate(src = set_substance_units(analyte, value, unit)) |>
  mutate(new_unit = <specify the destination unit>) |>
  mutate(dst = set_substance_units(analyte, src, new_unit))

EDIT: A more specific proposal, maybe more in-line with the units workflow:

library(substances)

load_substances_df(<data frame of parametric conversions>)

df |>
  mutate(src = set_substances(mixed_units(value, unit), analyte))
  mutate(new_unit = <specify the destination unit>) |>
  mutate(dst = set_units(src, new_unit, analyte)) # analyte here is optional
billdenney commented 8 months ago

(Thank you for the edit to use the units workflow, when possible. I was thinking that we should use units generic functions whenever feasible, too.)

I would hope that the code would be a little simpler than what you suggest:

library(<new package>)

# This would not be required if using the specific package (the xxx.clinical package would already
# have these conversions built-in), but it would be required if using the general package
set_substance_conversions(<data frame of parametric conversions>)

df |>
  mutate(src = set_substance_units(analyte, value, unit)) |>
  mutate(new_unit = <specify the destination unit>) |>
  mutate(dst = set_units(src, new_unit))

I dropped the "analyte" from the second call to set_substance_units() because it would already be contained within src.

Another use would be standardize_substance_units() which would choose the unit that is indicated as "typical" during the call to set_substance_conversions(). So, a simpler workflow could be:

library(<new package>)

set_substance_conversions(<data frame of parametric conversions>)

df |>
  mutate(src = set_substance_units(analyte, value, unit)) |>
  mutate(dst = standardize_substance_units(src))

The methods would also need accessors to the attributes:

df |>
  mutate(src = set_substance_units(analyte, value, unit)) |>
  mutate(dst = standardize_substance_units(src)) |>
  mutate(
    dst_value = as.numeric(dst),
    dst_unit = as.character(units(dst))
  )

The above does not cover the conversion between analytes (e.g. "1 mole carbon dioxide" = "1 mole carbon" and "1 mole carbon dioxide" = "2 moles oxygen"), but that is a much bigger lift to get right and maybe it should not be included at this time. This suggestion is not simply interface-bloat; I do use that type of conversion between analytes.

My specific use case for needing to convert between analytes is a medicine and its metabolite. I need to calculate the amount of a medicine that comes out in urine. I receive data like "10 mg simvastatin" was dosed; we measured "10 ng/mL beta-hydroxy simvastatin" in urine 500 mL urine. What fraction of the dose came out in urine as "beta-hydroxy simvastatin"? The process is

I think that would need another method like

set_between_substance_conversions(<data frame of parametric conversions between substances>)
Enchufa2 commented 8 months ago

We have a nice initial specification, so I have transferred the issue to a separate repo (thoughts on the name?). The MVP would be conversions within the same substance, so I'll try to address that first. We can iterate and address cross-substance conversions later.

billdenney commented 8 months ago

I like the name. Let's keep it! :) (I'm also happy to entertain other names; I can't think of a better one right now.)

I agree that cross-substance conversions can come later. I wanted to make sure that they were considered throughout the process so that we don't end up making an API that can't work with the concept.

The other thing that we should ensure is that we keep the naming as consistent as feasible. In the drafting discussion, we used several terms. I think that the terms we settled on are:

billdenney commented 8 months ago

I'm tagging several people who had similar questions to the proposal above in case they have additional ideas that may help:

ilikegitlab commented 8 months ago

so now i'm Cc i feel obliged to say something about my workflow: I'm mainly working with gas at the moment. Units are mol fractions, concentrations and partial pressures. So not only do i want to go from mols to grams, but maybe also to partial pressure (which luckily is often the same at sealevel) and for water, we can also go from mol fraction to a percentage (relative humidity) and for isotopes we use permil (which surprisingly is not even defined in units). I have no problem in doing conversions manual, but the consequence is I need like to have mol per mol in my data or things become confusing. More annoyingly the mol air is not even well defined because perhaps sometimes I used "different" air, with different gram/mol.

I'm not sure I would really need another package. What would it bring me other than a simpler way of converting (its not that often that I need to do it)? I also had a dataset recently where I had m^2/m^2, but still to very different areas, so the problem is not exclusive to substances.

Enchufa2 commented 8 months ago

My aim here is to specifically ~solve~ workaround the problem of "counts of things" that have a translation to SI units (typically mass, but could be e.g. electric charge). The relationship of those parametric units, as defined by Johansson, to the SI units depend on the things being considered, and this is why I called this substances for now.

Fractions of the same units are a different beast... When you define a parametric conversion (the aim here), you are referring to a single thing (e.g. g/mol or mol/g of atoms of oxygen). But when you define a fraction like g/g, you are referring to two different things. And unfortunately this is much much harder to handle based on a system like UDUNITS (or any units systems out there that I know for that matter, because not even the SI takes these things into consideration). I'll keep that in mind, but in principle, this is not the goal of this package proposal.

BTW, @ilikegitlab, could you please comment further on that example of m^2/m^2?

henningte commented 8 months ago

I think it is a good idea to collect conversions for some often-used parametric units in an own package. I'm not yet sure if I fully understand the scope and the sketched implementation of the planned package and the main reason certainly is that I'm not too familiar with formal definitions of unit systems or parametric units.

In any case, I think that defining a naming scheme for the substances and limiting the scope of considered substances are important tasks because otherwise things may get too complicated if one considers the diversity of chemical substances alone (e.g. the same compound (same chemical formula and bonds) but with different charges, etc.). This is another reason why I'm unsure about the scope.

One problem that came to my mind was how to avoid automatic conversion if one installs conversions like grams of CO2 to grams of O and grams of H2O to grams of O. Wouldn't it in such a case be possible to create ambiguities, e.g. that it is possible to compute 2 grams of CO2 + 8 grams of H2O = 12 grams of O which may be desired behavior in some cases, but not in others?

I noticed that @billdenney said this may be too complicated to consider in a first sketch of the package, but perhaps similar things could happen in other contexts (e.g. conversion of mols of hydrated compounds to mols of water).

I'm not sure at all how likely such things are, but if they are, it may be better to force explicit unit conversion. For example, one could do something like this (i.e., provide a table/list conversion_constants in the planned package from which one can access conversion constants explicitly, but these constants are not installed via the 'units' package):

library(units)
#> udunits database from C:/Users/henni/AppData/Local/R/win-library/4.3/units/share/udunits/udunits2.xml

# units which need to be installed
install_unit("mol_CO2_")
install_unit("mol_water_") # I got an error that the unit is not defined with install_unit("mol_H2O_")
install_unit("mol_O")

# example for a conversion table holding conversion constants
conversion_constants <- 
  list(
    CO2 =
      list(O = units::set_units(2, mol_O/mol_CO2_)),
    H2O =
      list(O = units::set_units(1, mol_O/mol_water_))
  )

# Then nonsense is avoided by default:
units::set_units(1, mol_CO2_) + units::set_units(1, mol_water_)
#> Error: cannot convert mol_water_ into mol_CO2_

# But you can make the conversions explicitly (this can certainly be simplified)
units::set_units(1, mol_CO2_) * conversion_constants$CO2$O + units::set_units(1, mol_water_) * conversion_constants$H2O$O
#> 3 [mol_O]

Created on 2023-10-17 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.1 (2023-06-16 ucrt) #> os Windows 11 x64 (build 22621) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.utf8 #> ctype German_Germany.utf8 #> tz Europe/Berlin #> date 2023-10-17 #> pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.1) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.1) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.1) #> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.1) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.1) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.1) #> Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.1) #> rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1) #> units * 0.8-3 2023-09-06 [1] Github (billdenney/units@d57f54d) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.1) #> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.1) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0) #> #> [1] C:/Users/henni/AppData/Local/R/win-library/4.3 #> [2] C:/Program Files/R/R-4.3.1/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

Just a first thought from my side, in case it is useful.

ilikegitlab commented 8 months ago

@Enchufa2: The aims sound sensible, but I wonder: in the case of grams of oxygen per m^3 of air, or mols of sugar per kg of water (Osmolality) would I not also be essential be referring to two different things?

as for m^2/m^2 (or g/kg) this comes up in allometric relationships describing ratios of body parts of plants and animals. You are right they are two different things. Although it may make sense to simplify it to [1], in practice this breaks current math with udunits because:

(g leaf)/(g plant) * (m^2 leaf/g leaf) = (m2 leaf)/(g plant)

I agree one could go through the trouble of redefining a gleaf and gplant unit, but care should be taken not to make things too rigid or complex because then many people may just drop units at the earliest convenience (I admit I found myself wanting to write a dispense_units(math, reapply="units") method at some point, which I still have managed to avoid!)