vincentarelbundock / Rdatasets

A collection of datasets originally distributed in R packages
https://vincentarelbundock.github.io/Rdatasets
Other
315 stars 431 forks source link

Potential esoph data inaccuracy. #3

Closed AJFOWLER closed 3 years ago

AJFOWLER commented 3 years ago

Hello!

I've been working through some tutorials using the data provided by Tuyns.[1] It is described in some detail in both this stata bulletin about the population attributable fraction, and in these lecture slides.

In these descriptions, the sum of patients should be 975 (200 cases and 775 controls).

However, the esoph data:

library(datasets)
dat <- esoph
sum(dat$ncontrols) #=975
sum(dat$ncases) #= 200

This gives a total of 1175 records. I think an error has slipped in whereby controls is actually the total number of records (sum of cases and controls). If esoph$ncases is subtracted from esoph$ncontrols then this provides a true number of controls that matches the above descriptions.

One thing to add here, I am 99% sure that the esoph data comes from the Tuyns paper for a couple of reasons, but can't find the exact reference as I don't have access to the Breslow book quoted in the r documentation.

The reasons are:

  1. The number of combinations is exactly the same (88 combinations of age, alcohol status, tobacco status)

  2. The town of Ille-et Vilaine is mentioned in the r documentation of esoph

  3. Counts and features otherwise match precisely, and odds calculated on the corrected data match those provided by tutorials.

A solution to this would be to subtract ncases from ncontrols to provide a total number column, a cases column and a controls column.

Happy to open a PR to do this if helpful. I couldn't find a github repo for the core-R datasets code so thought best to open this issue here first as I see you have that dataset included.

[1] Tuyns, A. J., G. Pequignot, and O. M. Jensen. 1977. Le cancer de l’oesophage en Ille-et Vilaine en fonction des niveaux de consommation d’alcool et de tabac. Bulletin of Cancer 64: 45–60

vincentarelbundock commented 3 years ago

Unfortunately, Rdatasets is a completely independent effort. I have no special way of communicating with the original data/package maintainers. I don't even know who maintains the esoph data. Your best bet might to look for a way to open a ticket on R-forge somewhere, or to post on one of the R devel mailing lists.

Sorry.

AJFOWLER commented 3 years ago

Thanks Vincent, apologies for bothering you with this.

vincentarelbundock commented 3 years ago

No worries. I'm just sorry I can't do much more. RDatasets is really just mooching off other people's great work ;)