statistikat / simPop

Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information
30 stars 7 forks source link

multiom method #21

Closed Kyoshido closed 2 years ago

Kyoshido commented 2 years ago

Hello,

I was playing with the method of multinom for simulation of household size and had this error:

Error in nnet.default(X, Y, w, mask = mask, size = 0, skip = TRUE, softmax = TRUE, :
too many (1404) weights

So according to the answer in https://stackoverflow.com/questions/36303404/too-many-weights-in-multinomial-logistic-regression-and-the-code-is-running-for I added argument MaxNWts which in the nnet package is in general for controlling the maximum number of weights.

Then another error appeared in lapply where we draw from original sample for each stratum

Error in sample.int(length(x), size, replace, prob) :
incorrect number of probabilities

This function

  numbers <- lapply(1:ncomb, function(i) {
    n <- households[grid[i, 1], grid[i, 2]]
    w <- wH[split[[i]]]
    p <- w / sum(w)  # probability weights

    spSample(n, p)
  })

The reason for this error is that in my households size, there are these sizes

> levels(as.factor(hsize))
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "19" "20"

You can see that I don't have size of 17 and 18.

Everything goes well until this function

hsizePH <- unlist(lapply(ls, function(l) spSample(NH[l], probs[l,])))

because it somehow changed the sizes of households

> levels(as.factor(hsizePH))
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17" "18"

This happened because of spSample(NH[l],probs[l,]) which is function for sample(length(p),size=n, replace=TRUE,prob=p) and here lies my problem p <- probs[l,]

> probs[l,]
           1            2            3            4            5            6            7 
0.2764668051 0.2697735219 0.1996712258 0.1852551015 0.0162689795 0.0489499869 0.0010677580 
           8            9           10           11           12           13           14 
0.0006122110 0.0002974661 0.0002422427 0.0002950224 0.0001680283 0.0001476288 0.0001612824 
          15           16           19           20 
0.0001576700 0.0001626591 0.0001522689 0.0001501416 

But the sample is done from household sizes from 1:18 because of the length(p), which is 18 but the maximum value in p is not.

So I solved it by replacing lenght(p) with as.numeric(names(p)). This secures that only household sizes that are truly in original dataset are then sampled.

> levels(as.factor(hsizePH))
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "19" "20"

After the code can run without errors. What do you think about it?

Kyoshido commented 2 years ago

The same problem also exists for the method of distribution. The same repair can be done also for it.

matthias-da commented 2 years ago

Hi Jiri

sorry for the slow answers. I am mostly in holiday before a change in affiliation ;-)

Thanks a lot for the modifications. We are really happy for this.

It seems there is now a warining because of doc mismatch with function arguments. I will repair this before releasing a new version to CRAN.