statistikat / simPop

Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information
30 stars 7 forks source link

Error in pnM[, numbers[[i]]] : indice fuori limite #4

Closed Angelo3452 closed 4 years ago

Angelo3452 commented 4 years ago

Dear all, I am a PhD candidate currently working to synthesize a population for my case study. To do so I am exploiting the simPop package, following the steps reported in "Simulation of Synthetic Complex Data: The R Package simPop". I can succesfully define the first round of weights based on gender distribution per district, but as soon as I try to create the household structure through the simStructure function I get the following error:

Error in pnM[, numbers[[i]]] : indice fuori limite ## indice fuori limite = subscript out of bounds Inoltre: Warning message: In matrix(pn, nrow = as.numeric(grid[i, 1])) : data length [243] is not a sub-multiple or multiple of the number of rows [2]

I am attaching the two dataset I am inputing (anonymized with only the information relevant for the example), as xlsx since it seems I cannot upload .csv. Moreover I cannot upload the .RData that led me to the issue but I am listing in the following the functions I used: set.seed(1234) library("simPop") data("Survey3", package="simPop") inp<-specifyInput(Survey3, hhid = "ï..InterviewNumber", hhsize = "HouseholdSize", strata = "Districts",

I remain available for any question you might have, in the meantime I thank you for your time. Best

CensusFreq3.xlsx

matthias-da commented 4 years ago

Hi

as I wrote you, we would need a reproducible example. Your example isn't reproducible, it contains at least 4 errors and did not include the part to import the data. This is a reproducible example:

library(readxl)
Survey3 <- data.frame(read_excel("~/Downloads/Survey3.xlsx"))
CensusFreq3 <- data.frame(read_excel("~/Downloads/CensusFreq3.xlsx"))
set.seed(1234)
library("simPop")

Survey3$Districts <- factor(Survey3$Districts)
inp <- specifyInput(Survey3, 
                    hhid = "InterviewNumber", 
                    hhsize = "HouseholdSize", 
                    strata = "Districts",
                    weight = "Weight")

addWeights(inp) <- calibSample(inp, CensusFreq3)
synthP <- simStructure(data = inp, 
                     method = "direct",
                     basicHHvars = c("Age", "Gender", "Districts"))

Nevertheless, with household size 2 we already have a problem, see the debugging result:

debugonce("simStructure")

synthP<-simStructure(data = inp, 
                     method = "direct",
                     basicHHvars = c("Age", "Gender", "Districts"))

# in debug-mode, line 86:

i <- 2
pn <- which(hid %in% hidH[split[[i]]]) 
pnM <- matrix(pn, nrow = as.numeric(grid[i, 1])) # warning, why to have two rows?
c(pnM[, numbers[[i]]]) # error, much less columns as needed

I think Alex or Bernhard wrote this part of the code?

I think its a minor problem which only occurs when providing hhsize, which should be anyhow not an important information, because hhid is spedified. Thus this code runs without error:

library(readxl)
Survey3 <- data.frame(read_excel("~/Downloads/Survey3.xlsx"))
CensusFreq3 <- data.frame(read_excel("~/Downloads/CensusFreq3.xlsx"))

set.seed(1234)
library("simPop")

Survey3$Districts <- factor(Survey3$Districts)

inp <- specifyInput(Survey3, 
                    hhid = "InterviewNumber", 
                  #  hhsize = "HouseholdSize", 
                    strata = "Districts",
                    weight = "Weight")

addWeights(inp) <- calibSample(inp, CensusFreq3)
synthP <- simStructure(data = inp, 
                     method = "direct",
                     basicHHvars = c("Age", "Gender", "Districts"))

I would keep it in mind to look at the problem, but keep it on the list as "minor" important to solve.

matthias-da commented 4 years ago

Solved.

Angelo3452 provide an inconsitent data set.

See

head(Survey3, 2)

InterviewNumber Districts HouseholdSize Age Gender Weight 1 92378 1 1 67 1 1 2 348385 2 2 57 2 1

and

Survey3[Survey3$InterviewNumber == 348385, ]

InterviewNumber Districts HouseholdSize Age Gender Weight 2 348385 2 2 57 2 1

There must be 2, not 1 for a household of size 2.

We are closing this issue.

miilljana commented 2 years ago

Hi

I am getting the same error as @Angelo3452 : Error in pnM[, numbers[[i]]] : indice fuori limite ## indice fuori limite = subscript out of bounds

The microdata I am using has only information about the reference person of each household including information about the hhsize which means there is no info about each person from one household, but only for the reference ones. Since I have this kind of data, applying directly simStructure method for replication of household is impossible.

Another option is to replicate the individuals manually (first interegisation of weights and then expansion of the microdata to match the district's constraints).

I would be grateful for any suggestions.