Update package to CODATA 2018 values?

solarchemist commented 3 years ago

The CODATA values are adjusted on a roughly four-year cycle, and the latest revision (2018 CODATA) represented an unusually large adjustment due to the revised SI. It appears that the publication of the customary review article(s) in the primary literature that finalise such an adjustment are slightly delayed this time (according to this CODATA blog post review articles on the 2018 CODATA adjustments were supposed to be published by October 2020). I guess we can forgive the CODATA task group for being slightly late a year like this.

Now the issue that arises, I think, is that the NIST reference website already lists the 2018 CODATA values. For example, NIST lists the elementary charge as 1.602176634 x 10-19 C (with zero uncertainty, as it is now exactly defined), whereas the current version (v0.0.2) of this package shows

> constants::codata %>% filter(symbol == "e")
           quantity symbol            value unit rel_uncertainty            type
1 elementary charge      e 1.6021766208e-19    C         6.1e-09 electromagnetic

So a user that compares the values from this package to the NIST reference will find that many diverge, and be understandably confused.

The commit history indicates that this package has not experienced a CODATA adjustment before, so I am curious how you would like to approach this issue. Only update the package once the review articles are published (and risk confusing some users in the meantime)? Or update the package in sync with the web-based NIST reference?

In either case, would you care to elaborate on how such an update would be carried out (at a cursory glance through the source, I did not find any function that fetches/scrapes data from some source), so I assume the data is to be collated/tabulated and put into this package in some fashion?

I have a few data R packages under my belt (none on CRAN though), so I've a basic understanding of the structure of this package. I would be happy to help with such an update, given some pointers.

Enchufa2 commented 3 years ago

Thanks for looking into this.

Yes, I would definitely like to update the package with the latest CODATA edition. I wasn't aware that they updated the web a year ago. We can update the package again later when the publication comes out.
I did this once, already two years ago, when I built this package for the first time, so I don't remember. I have to check my notes and come up with a plan.

solarchemist commented 3 years ago

You're quite welcome.

That sounds like a plan :-)

Enchufa2 commented 3 years ago

There's a new branch called codata2018 with the new values. The first time I did this, it was part automated, part by hand. Quite a mess, not replicable. This time, I made a script to scrape the values. I could use another set of eyes: it would help if you could install the package from that branch and check the constants::codata table.

I noticed that they have correlations for each pair of constants too, so given that the errors package supports correlations, I'm scrapping those too. This will take a while though.

solarchemist commented 3 years ago

In the interest of giving you feedback quickly, I've primarily inspected the codata::constants table. I've compared the dataset in the codata2018 branch to master, and would like to note the following.

The order of the columns in this branch (c("symbol", "quantity", "type", "value", "uncertainty", "unit")) is not the same as in master (c("quantity", "symbol", "value", "unit", "rel_uncertainty", "type")). I think sticking to a defined order across package versions is better, but this is a minor point.
The uncertainty column has changed name. This is a potentially breaking change. I actually prefer absolute uncertainty (easier to understand at a glance), so I vote for it - go ahead and drop rel_uncertainty in favour of uncertainty.
The symbol for many constants has changed, which is confusing. I suppose it's NIST that has changed them (the extraction of the symbol string from the URL in extract_symbols() appears to work as intended). I hope that this column stays the same over time, because having a unique identifier for each constant is quite valuable.

Finally, there's something going on with the type column. First, a bunch of constants have NA values, which is probably not intended (there are no NAs in the master branch type column). Second, the different kinds of types have changed markedly in this branch compared to master. Is that due to NIST changing theirs, or something else?

For reference, regarding the second point, this is master:

> codata %>% pull(type) %>% unique()
 [1] "universal"                  "electromagnetic"            "atomic-nuclear-general"     "atomic-nuclear-electroweak" "atomic-nuclear-electron"   
 [6] "atomic-nuclear-muon"        "atomic-nuclear-tau"         "atomic-nuclear-proton"      "atomic-nuclear-neutron"     "atomic-nuclear-deuteron"   
[11] "atomic-nuclear-triton"      "atomic-nuclear-helion"      "atomic-nuclear-alpha"       "physicochemical"            "adopted"

... and this is this branch:

> codata %>% pull(type) %>% unique()
 [1] "Atomic and nuclear"               "X-ray values"                     "Physico-chemical"                
 [4] "Non-SI units"                     "Physico-chemical, Adopted values" "Electromagnetic"                 
 [7] "Universal"                        "Electromagnetic, Adopted values"  "Adopted values"                  
[10] "Universal, Adopted values"        "Universal, Non-SI units"          NA

Thanks for making the download/scraping logic available like this, I certainly learned a lot by reading through it.

Lastly, I suppose it's prudent to keep in mind that the 2018 "adjustment" is a really major one - the number of constants has gone from 237 to 354!

Enchufa2 commented 3 years ago

Thanks. Note that I've increased the major version number of the package, which signals breaking changes. These are required to make seamless future updates. Comments inline:

In the interest of giving you feedback quickly, I've primarily inspected the codata::constants table. I've compared the dataset in the codata2018 branch to master, and would like to note the following.

The order of the columns in this branch (c("symbol", "quantity", "type", "value", "uncertainty", "unit")) is not the same as in master (c("quantity", "symbol", "value", "unit", "rel_uncertainty", "type")). I think sticking to a defined order across package versions is better, but this is a minor point.

The new column order is intended. It makes more sense to me, so I'm taking advantage of the general overhaul to introduce this minor change too (nobody should depend on the column order anyway).

The uncertainty column has changed name. This is a potentially breaking change. I actually prefer absolute uncertainty (easier to understand at a glance), so I vote for it - go ahead and drop rel_uncertainty in favour of uncertainty.

For 2014 values, the information was scrapped from the PDF of one of the papers, which is suboptimal and error-prone. The paper had only the relative uncertainty, and so the package ended up with that column. I much prefer the absolute one. The new scrapping method takes all the info (value, unit, uncertainty) more elegantly in one go from the corresponding webpage.

The symbol for many constants has changed, which is confusing. I suppose it's NIST that has changed them (the extraction of the symbol string from the URL in extract_symbols() appears to work as intended). I hope that this column stays the same over time, because having a unique identifier for each constant is quite valuable.

The previous symbol column was hand-crafted. At that time, I didn't notice that the NIST has a unique ASCII identifier for each constant in the webpage. So I decided to drop the old names and stick to the ones given by NIST, which will be, most probably, stable over time, making future updates straightforward. This is the biggest breaking change, but I believe it is for the good.

Finally, there's something going on with the type column. First, a bunch of constants have NA values, which is probably not intended (there are no NAs in the master branch type column). Second, the different kinds of types have changed markedly in this branch compared to master. Is that due to NIST changing theirs, or something else?

[...]

Lastly, I suppose it's prudent to keep in mind that the 2018 "adjustment" is a really major one - the number of constants has gone from 237 to 354!

The old categories were scrapped from the paper too. Now these names are taken from the web, which is more robust. There are more constants listed on the website than in the paper I used the last time, and some of those are really not listed in any category, thus 1) the NAs and 2) the fact that we have more constants now.

Enchufa2 commented 3 years ago

One question. There are cases in which the NIST symbol matches a function in base R, so attaching those symbols would conflict with base R. In particular, there are two cases: c and sigma. Therefore, we need to add something to those symbols to avoid these collisions. Any preference? Currently, I'm appending a zero, c0 and sigma0, but then the original NIST symbol is no longer recoverable. I could append, e.g., a dot instead: c. and sigma.. Thoughts?

solarchemist commented 3 years ago

Thanks for the thorough answers. It's apparent you had already considered the points I raised, so thanks for your patience too!

Regarding your question, my knee-jerk reaction is why are name clashes an issue? The constants are not exposed in the R parent environment after loading the library, are they?

In any case, perhaps we could add a hash symbol, like c#? Then the NIST symbol is still recoverable using URLs, i.e., https://physics.nist.gov/cgi-bin/cuu/Value?c https://physics.nist.gov/cgi-bin/cuu/Value?c# both resolve correctly. Just an idea, not thought through or anything. Perhaps there are other characters that can be used in the same way that are more suitable.

Enchufa2 commented 3 years ago

Attaching symbols, or a subset of them, is a comfortable way of working with them, and I don't want collisions with base R. Especially when one of the collisions is the function c(), which is quite important in R.

The hash symbol starts a comment, so c# is the same as c. Besides, the speed of light in vacuum is commonly known as c0, so I think I'll stick to this.

Enchufa2 commented 3 years ago

On CRAN now.

Enchufa2 commented 3 years ago

BTW, thanks for the post, nice one!

solarchemist commented 3 years ago

Thank you, I appreciate that! I hope it spurs others to use and/or contribute to r-quantities. I think quantity analysis built-in to R/tidyverse in this manner is a major quality improvement to any scientist's tool-set, and well worth the small extra effort.

r-quantities / constants

Update package to CODATA 2018 values? #6