vincentarelbundock / Rdatasets

A collection of datasets originally distributed in R packages
https://vincentarelbundock.github.io/Rdatasets
Other
315 stars 431 forks source link

Minor discrepancies between subfolder csvs and master sheet #30

Closed ArthurSpirling closed 1 year ago

ArthurSpirling commented 1 year ago

Hello @vincentarelbundock -- thanks so much for providing these data.

I did a very quick scan through the data and documentation for the same. In particular, I was looking for any discrepancies between this main sheet and the names of the data sets themselves (as in name.csv) stored in the subfolders.

Here are some that are found that appear in the data as csvs, but not documented on the sheet. This was very rough and ready, and I might have missed something, but just in case it's helpful for your sweeps --

"aldh2" "apoeapoc" "bomregions2011" "bomregions2012" "bomsoi2001" "cf" "cnv" "crohn"
"Damian" "fa" "fsnps"
"head.injury" "hla" "inf1"
"jma.cojo" "l51" "lukas" "mao"
"meyer" "mfblong" "mr" "nep499"
"PD"

For example, bomregions2012.csv appears in the DAAG subfolder, but not on that master sheet. And indeed, it has documentation here.

Again, thanks for all this work!

ArthurSpirling commented 1 year ago

Ah, also, there's an entry for hdma and hmda both from Ecdat and both seemingly identical descriptions (?) and docs.

ArthurSpirling commented 1 year ago

Update: DAAG contains both a head.injury.csv and a headInjury.csv --- which may be identical? not sure.

vincentarelbundock commented 1 year ago

Thanks for the report. Glad the website is useful!

I looked at a few of these and my best guess is this:

  1. My script never calls git rm on anything, so datasets stay there forever. This is important in case someone links to the URL in one of their scripts.
  2. However, the main sheet index is created every time I run the script, and that's based on what is currently available in the packages. I think that also makes sense: If a package maintainer removes a dataset, I may still want to keep permanent links to protect users, but it's probably "polite" to not advertise the dataset anymore.

The few datasets I checked didn't seem to be available in their packages anymore. And in the head.injury case, the DAAG changelog says it was a duplicate and was removed:

https://github.com/cran/DAAG/blob/master/NEWS#L27

Again, I didn't check them all, but my provisional conclusion is that things are probably fine as-is. Makes sense?

ArthurSpirling commented 1 year ago

Sounds good, thanks very much.