Variance calculation functions using "Multiple count" estimator

davidcurrie2001 commented 2 years ago

In the data quality part of the FishNCo project we are working on variance calculations that can be applied to a multi-regional sampling programme.

The intention is to implement the "Multiple count" estimator in R, using the RDBES format.

rix133 commented 2 years ago

I'm also looking into it. Probably best if discussion is here instead of e-mail list

rix133 commented 2 years ago

I think the issue is in diagonals If I use if(k==l){part1 <- (Enk - Enk*Enl) / (Enk)} I get realistic values see below:

  for (k in 1:nrow(y)){
    #print(k)
    Enk <- mcGetExpectedValueK(y,k,selectionMethod)
    part2 <- y[k,'studyVariable'] / Enk
    for (l in 1:nrow(y)){
      #print(l)
      Enl <- mcGetExpectedValueK(y,l,selectionMethod)
      Enknl <- mcGetExpectedValueKL(y,k,l,selectionMethod)
      if(k!=l){part1 <-  (Enknl - (Enk * Enl)) / Enknl}
      if(k==l){part1 <- (Enk - Enk*Enl) / (Enk)}
      #part1 <-  abs(Enknl - (Enk * Enl)) / Enknl
      part3 <- y[l,'studyVariable'] / Enl
      varEstimate_kl <- (part1 * part2 * part3)
      if (is.na(varEstimate)){
        varEstimate <- varEstimate_kl
      } else {
        varEstimate <- varEstimate + varEstimate_kl
      }
    }
  }
  varEstimate

Why I thought of this is the estimators in sampingbook have different Enknl in the matrix diagonals

rix133 commented 2 years ago

I think the reason is that as "π_kl is the probability of including both unit k and l in the sample " then the joint inclusion probability is simply n/N when k and l are the same. The joint inclusion probability of k and l is the product of n/N * (n-1)/(N-1) only when k != l.

davidcurrie2001 commented 2 years ago

Yes, I think you're right about the diagonals. Taking a look at equation (4) in http://www.asasrms.org/Proceedings/y2009/Files/302918.pdf which is referenced by Annica it sums over all i,j where j != i.

I've just implemented their version of the generalised variance estimator (equation 4) and it is giving positive variance values:

varEstimate <- NA

  for (k in 1:nrow(y)){
    for (l in 1:nrow(y)){
      # Don't include k == l
      if (k != l){
        Enk <- mcGetExpectedValueK(y,k,selectionMethod)
        Enl <- mcGetExpectedValueK(y,l,selectionMethod)
        Enknl <- mcGetExpectedValueKL(y,k,l,selectionMethod)
        part1 <- (Enk*Enl - Enknl)/Enknl
        part2a <- y[k,'studyVariable'] / Enk
        part2b <- y[l,'studyVariable'] / Enl
        part2 <- (part2a - part2b)^2
        varEstimate_kl <- part1 * part2
        if (is.na(varEstimate)){
          varEstimate <- varEstimate_kl
        } else {
          varEstimate <- varEstimate + varEstimate_kl
      }
      }
    }
  }

  varEstimate <- varEstimate/2

  varEstimate

I still don't know whether it gives the correct answer though...

rix133 commented 2 years ago

I think, having 0 var if all items are included, is a good test (like you demonstrated before) otherwise you are right correct answer is the key. BTW just committed my implementation of estimMC into the dev branch

davidcurrie2001 commented 2 years ago

I've just checked and for data that I've sampled without replacement (WOR) I get the same variance estimate using your implementation as I do using my latest implementation - that's encouraging. I get different answers for sampling with replacement (WR) though - I think your probability matrix might be wrong in that situation. It looks like you have used the following for all cases:

sampled / total * (sampled - 1) / (total - 1)

but I think that's only true for WOR - for WR it should be:

sampled * (sampled - 1) / total^2

I still get a different answer when I make that change though :-S

davidcurrie2001 commented 2 years ago

For WR the diagonal of the matrix also won't be equal to the first order probability because selecting the same unit twice is allowed and the probability of it happening can be calculated (unlike WOR).

rix133 commented 2 years ago

Ok this is good point!

rix133 commented 2 years ago

I pushed a new version where I think the estimMC has now correct variance for SRSWR (I hope that the PI is correct now as well :) Interesting point to observe is that for HH the variance will never be 0 as there is no guarantee that all elements are in the sample at any number of samples.

davidcurrie2001 commented 2 years ago

When I use your new code I get exactly the same totals and variances for WOR and WR with my test data - we've achieved consistency at least :-) The script I used is at MCtest01.R Good point about the variance for WR - I'd naively assume it tends to 0 as the number of selections increases but I haven't checked.

davidcurrie2001 commented 2 years ago

I did just spot a difference. If you only have a single sample (n=1) then I don't believe it should be possible to calculate variance. When I run your code with n=1 for WR I get NaN, but when I run it for WOR I get a numerical value returned for variance. Can you check that?

rix133 commented 2 years ago

you mean if n=1 and N>n yes, if you look at the theory Annika presented π_k can be calculated and π_kl is 0 as n(n-1) is 0 and 0/N is 0. Indeed the variance formula contains division with π_kl. But I think you must consider that

π_kl is the probability of including both unit k and l in the sample

I think then (when k=l) the joint inclusion probability is simply n/N. Which is the case for WOR sampling if n=1. And then there is no divison with 0. I guess we need a statistician to sort out who has the correct theory :)

rix133 commented 2 years ago

I did not write before but yes I agree that calculating variance using a single sample seem intuitively impossible.

annicadg commented 2 years ago

Hi, not sure if I can help but will do my best. First, the general variance estimation formula can be split into two parts. The first part is the sum over the cases where k and l is equal. I emailed you this formula today, I will also update the document with it. Second, there are two possible variance estimators: the Horvitz-Thompson and the Yates-Grundy (the latter is what you found in the paper by Chromy). Both are unbiased but they do not always give the same result, only in special cases. The Yates-Grundy can only be used for fixed-size sampling designs, but aside from that limitation it is often preferred. I gave the Horvitz-Thompson estimator since it is the most general. A simple test if you get the calculations right is perhaps to do them both the general way (inserting the suitable inclusion/drawing probabilities in the general formulas) and the simple way by using the simplified formulas for a specific design. For instance simple random sampling with/without replacement.

annicadg commented 2 years ago

Hi again, I included a short section (section 4) about simple random sampling in the document, see attached file. For this simple design the general formulas simplify a lot. I mark the simplified formulas in yellow. If you get the same results with the general formulas as with the simplified formulas it suggests that the calculations might be correct. A Generalized Horvitz-Thompson Estimator v2.docx

rix133 commented 2 years ago

Thank you @annicadg for the update and simplifications. The only thing not working for me in the general case is SRSWR variance (in the function estimMC ). The A Generalized Horvitz-Thompson Estimator v2.docx does not specify how to define pk the previous version states that pk = 1/N but I don't get correct variance if I add this to the E(nk) definition (I do get correct total though). For variance I also tried a different approach as E(nk nl ) needs the joint probability of selecting pk twice but with no luck.

annicadg commented 2 years ago

Hi, great that most things work now. Regarding the SRSWR, it is still meant to be pk = 1/N. Not sure what is wrong, could you please give some more details? How far off is your variance estimate?

annicadg commented 2 years ago

Hi again, I went through the document again and found a typo (an n missing) in the formula marked in green. Also tried to clarify some formulas. Not sure if this is of any help. A Generalized Horvitz-Thompson Estimator v3.docx

rix133 commented 2 years ago

Hi @annicadg, so I don't get it. If pk = 1/N What is pl? I thought that pk=pl even if k=l (in SRSWR each sample is independent of other and all have equal probability). But in this case if I replace it in the general formula (as E(nknl) = n(n-1)/N^2 and E(nk) = E(nl) = n/N ) n(n-1)/N^2 - n/N * n/N I get a negative value as n > n-1. What am I missing isn't E(nk) = E(nl) in SRSWR?

davidcurrie2001 commented 2 years ago

I couldn't get the variance from the generalised HT estimator for variance to match the variance calculated from the SRSWR equations. However I then tried implementing the generalized Yates-Grundy variance formula for HH (from section 3 of http://www.asasrms.org/Proceedings/y2009/Files/302918.pdf ) instead - these values do match those from the SRSWR equation and all the estimMC tests pass. You can find my alterations to estimMC.R in a new branch: https://github.com/ices-tools-dev/icesRDBES/blob/temp/R/estimMC.R

davidcurrie2001 commented 2 years ago

Discussed at meeting today - @annicadg will check the MC variance equation in the case of WR to see why its not giving the same results as the SRSWR simplified equation, or the Yates-Grundy HH formula.

davidcurrie2001 commented 2 years ago

@annicadg suggested using the Sen-Yates-Grundy equation for variance calculation - note, this is only valid for designs that have fixed (not random) sampling sizes. This has been implemented and all tests for WR and WOR are now passing :-) - the function also calculates the same values as 2 worked examples. Next steps are to write the code to account for 1) Unequal probability #19 , and 2) Multi-stage sampling #15

ices-tools-dev / RDBEScore

Variance calculation functions using "Multiple count" estimator #13