Distinct dataset documentation #17

Open fvcortes opened 1 year ago

fvcortes commented 1 year ago

The documentation shown from 'housing' dataset don't match actual rows and columns imported

How to reproduce:

>>> from pydataset import data`
>>> df = data('housing')`
>>> df

       id    y  time  sec
1       1  1.0     0    1
2       1  2.0     6    1
3       1  2.0    12    1
4       1  2.0    24    1
5       2  1.0     0    1
...   ...  ...   ...  ...
1444  361  NaN    24    0
1445  362  1.0     0    0
1446  362  1.0     6    0
1447  362  1.0    12    0
1448  362  1.0    24    0

[1448 rows x 4 columns]

>>> data('housing', show_doc='True')


PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

Frequency Table from a Copenhagen Housing Conditions Survey


The housing data frame has 72 rows and 5 variables.





Satisfaction of householders with their present housing circumstances, (High, Medium or Low, ordered factor).


Perceived degree of influence householders have on the management of the property (High, Medium, Low).


Type of rental accommodation, (Tower, Atrium, Apartment, Terrace).


Contact residents are afforded with other residents, (Low, High).


Frequencies: the numbers of residents in each class.


Madsen, M. (1976) Statistical analysis of multiple contingency tables. Two examples. Scand. J. Statist. 3, 97–106.

Cox, D. R. and Snell, E. J. (1984) Applied Statistics, Principles and Examples. Chapman & Hall.


Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.


options(contrasts = c("contr.treatment", "contr.poly"))
# Surrogate Poisson models
house.glm0 <- glm(Freq ~ Infl*Type*Cont + Sat, family = poisson,
                  data = housing)
summary(house.glm0, cor = FALSE)
addterm(house.glm0, ~. + Sat:(Infl+Type+Cont), test = "Chisq")
house.glm1 <- update(house.glm0, . ~ . + Sat*(Infl+Type+Cont))
summary(house.glm1, cor = FALSE)
1 - pchisq(deviance(house.glm1), house.glm1$df.residual)
dropterm(house.glm1, test = "Chisq")
addterm(house.glm1, ~. + Sat:(Infl+Type+Cont)^2, test  =  "Chisq")
hnames <- lapply(housing[, -5], levels) # omit Freq
newData <- expand.grid(hnames)
newData$Sat <- ordered(newData$Sat)
house.pm <- predict(house.glm1, newData,
                    type = "response")  # poisson means
house.pm <- matrix(house.pm, ncol = 3, byrow = TRUE,
                   dimnames = list(NULL, hnames[[1]]))
house.pr <- house.pm/drop(house.pm %*% rep(1, 3))
cbind(expand.grid(hnames[-1]), round(house.pr, 2))
# Iterative proportional scaling
loglm(Freq ~ Infl*Type*Cont + Sat*(Infl+Type+Cont), data = housing)
# multinomial model
(house.mult<- multinom(Sat ~ Infl + Type + Cont, weights = Freq,
                       data = housing))
house.mult2 <- multinom(Sat ~ Infl*Type*Cont, weights = Freq,
                        data = housing)
anova(house.mult, house.mult2)
house.pm <- predict(house.mult, expand.grid(hnames[-1]), type = "probs")
cbind(expand.grid(hnames[-1]), round(house.pm, 2))
# proportional odds model
house.cpr <- apply(house.pr, 1, cumsum)
logit <- function(x) log(x/(1-x))
house.ld <- logit(house.cpr[2, ]) - logit(house.cpr[1, ])
(ratio <- sort(drop(house.ld)))
(house.plr <- polr(Sat ~ Infl + Type + Cont,
                   data = housing, weights = Freq))
house.pr1 <- predict(house.plr, expand.grid(hnames[-1]), type = "probs")
cbind(expand.grid(hnames[-1]), round(house.pr1, 2))
Fr <- matrix(housing$Freq, ncol  =  3, byrow = TRUE)
house.plr2 <- stepAIC(house.plr, ~.^2)

I can't find what the actual dataset imported means. I suggest adjusting the documentation to describe the correct one.