mrc-ide / leapfrog

Multistate population projection model for demographic estimation.
Other
2 stars 5 forks source link

review DemProj discrepancies #7

Closed jeffeaton closed 2 years ago

jeffeaton commented 2 years ago

There are a couple of discrepancies with demographic projection only simulation:

The demographic projection in leapfrog simulation matches EPP-ASM up to age 79. I was surprised that there was a difference between EPP-ASM and DemProj because I recall checking this quite carefully; but nonetheless.

_Note, leapfrog does not match EPP-ASM for age 80+ because EPP-ASM did not implement the survivorship for the age 79 / age 80+ groups correctly. This is fixed in leapfrog._


Difference in births per year

Here's some code to reproduce the difference in births per year calculated from Leapfrog and Spectrum.

devtools::load_all()

pjnz0 <- "tests/testdata/spectrum/v6.13/bwa_demproj-only-no-mig_spectrum-v6.13_2022-02-12.PJNZ"

demp0 <- prepare_leapfrog_demp(pjnz0)
hivp0 <- prepare_leapfrog_projp(pjnz0)

specres0 <- eppasm::read_hivproj_output(pjnz0)

lsim0 <- leapfrogR(demp0, hivp0)

plot(1971:2030, lsim0$births[-1] - specres0$births[-1],
     type = "l", main = "Difference in births: leapfrog - spectrum")
abline(h = 0, lty = 2)

births-difference


Difference in net migration age 0-5

In PJNZ file with net migration, but no AIM simulation, there is a difference in the population in age 0-5 years between the Spectrum simulation and Leapfrog simulation.

John and Rob mentioned that Spectrum disaggregates the age 0-4 net migration slightly differently.

devtools::load_all()

pjnz1 <- "tests/testdata/spectrum/v6.13/bwa_demproj-only_spectrum-v6.13_2022-02-12.PJNZ"

demp1 <- prepare_leapfrog_demp(pjnz1)
hivp1 <- prepare_leapfrog_projp(pjnz1)

specres1 <- eppasm::read_hivproj_output(pjnz1)

lsim1 <- leapfrogR(demp1, hivp1)

Difference between Leapfrog and Spectrum simulation in year 2:

> lsim1$totpop1[1:7,,2]
         [,1]     [,2]
[1,] 13664.08 13338.50
[2,] 12590.07 12423.31
[3,] 12325.64 12120.56
[4,] 11921.11 11723.03
[5,] 11499.93 11320.52
[6,] 11073.00 10921.57
[7,] 10660.10 10536.57

> specres1$totpop[1:7,,2]
   sex
age     male   female
  0 13666.28 13340.02
  1 12593.31 12425.50
  2 12326.37 12120.95
  3 11919.68 11722.01
  4 11496.88 11318.50
  5 11071.14 10920.36
  6 10660.10 10536.57

## difference
> lsim1$totpop1[1:8,,2] - specres1$totpop[1:8,,2]
   sex
age          male        female
  0 -2.1933509689 -1.5170047071
  1 -3.2470203550 -2.1956692707
  2 -0.7289412044 -0.3934629810
  3  1.4278751853  1.0162819558
  4  3.0577740778  2.0115519861
  5  1.8623348231  1.2033616710
  6  0.0001464848 -0.0001566303
  7 -0.0003429374  0.0001933984
jeffeaton commented 2 years ago

Feedback from @rlglaubius on DemProj discrepancies:

Difference in births per year

When Spectrum calculates births, it divides the PASFR(t,a) by the sum over all ages. I think Leapfrog does not include the normalization term. This is an issue because (e.g.) PASFRs in the example Botswana file sum to 100.0003 in 1971, so there are slightly more births in Leapfrog than Spectrum that year.

Next, I removed the PASFR normalization term from Spectrum. The graph is still spikey, but on a much more satisfying scale: image002

I believe the remaining difference is because births are truncated to five decimal places in the .DP file, whereas R has access to Leapfrog outputs without truncation.

Difference in net migration 0-5

Disaggregating ages 0-4: Below, I’ve copied the code Spectrum uses to disaggregate net migration and base-year population from five-year age groups to single ages. Inputs p1, p2, …, p5 correspond to net migration counts for ages 0-4, 5-9, …, 20-24; a0, a1, …, a4 are the resulting single-age values. If we were using Beers’ method, the code would be:

procedure SplitFirstGroupOld(p1, p2, p3, p4, p5 : double; var a0, a1, a2, a3, a4 : double);
begin
  a0 := 0.3333*p1 - 0.1636*p2 - 0.0210*p3 + 0.0796*p4 - 0.0283*p5;
  a1 := 0.2595*p1 - 0.0780*p2 + 0.0130*p3 + 0.0100*p4 - 0.0045*p5;
  a2 := 0.1924*p1 + 0.0064*p2 + 0.0184*p3 - 0.0256*p4 + 0.0084*p5;
  a3 := 0.1329*p1 + 0.0844*p2 + 0.0054*p3 - 0.0356*p4 + 0.0129*p5;
  a4 := 0.0819*p1 + 0.1508*p2 - 0.0158*p3 - 0.0284*p4 + 0.0115*p5;
end;

Instead, we are using base-year Sx values for ages 0…4, as returned by DP.GetSurvRate (DP.GetSurvRate(t,s,0) is survival after birth, so DP.GetSurvRate(t,s,a) for a>0 corresponds year-to-year survival for people aged a-1 at start-of-year):

procedure SplitFirstGroup(GB : TGBVariables; DP:TDPData;p1, p2, p3, p4, p5 : double; sex : byte; var a0, a1, a2, a3, a4 : double);
Var
  aSum : double;
begin
  if GB.GetInCalcStateMode(DP.proj) then // RG: Jeff, you can ignore this block, since it’s for projections starting after 1985
  begin
    a0 := DP.GetSurvRate1970(sex,1) * 2 * 1000;
    a1 := DP.GetSurvRate1970(sex,2) * a0;
    a2 := DP.GetSurvRate1970(sex,3) * a1;
    a3 := DP.GetSurvRate1970(sex,4) * a2;
    a4 := DP.GetSurvRate1970(sex,5) * a3;
  end
  else // DP_FIRST_INDEX is the first year of the projection
  begin
    a0 := DP.GetSurvRate(DP_FIRST_INDEX,sex,1) * 2 * 1000;
    a1 := DP.GetSurvRate(DP_FIRST_INDEX,sex,2) * a0;
    a2 := DP.GetSurvRate(DP_FIRST_INDEX,sex,3) * a1;
    a3 := DP.GetSurvRate(DP_FIRST_INDEX,sex,4) * a2;
    a4 := DP.GetSurvRate(DP_FIRST_INDEX,sex,5) * a3;
  end;

  aSum := a0 + a1 + a2 + a3 + a4;

  if aSum > 0 then
  begin
    a0 := a0 / aSum * p1;
    a1 := a1 / aSum * p1;
    a2 := a2 / aSum * p1;
    a3 := a3 / aSum * p1;
    a4 := a4 / aSum * p1;
  end
  else
  begin
    a0 := 0.3333*p1 - 0.1636*p2 - 0.0210*p3 + 0.0796*p4 - 0.0283*p5;
    a1 := 0.2595*p1 - 0.0780*p2 + 0.0130*p3 + 0.0100*p4 - 0.0045*p5;
    a2 := 0.1924*p1 + 0.0064*p2 + 0.0184*p3 - 0.0256*p4 + 0.0084*p5;
    a3 := 0.1329*p1 + 0.0844*p2 + 0.0054*p3 - 0.0356*p4 + 0.0129*p5;
    a4 := 0.0819*p1 + 0.1508*p2 - 0.0158*p3 - 0.0284*p4 + 0.0115*p5;
  end;
end;

Now, this only comes up because Spectrum (1) reads 1x1 net migration from a UPD file, (2) aggregates migrants to five-year age groups, (3) factors migrants into overall net migrant numbers and age patterns for the DemProj input editor, then (4) disaggregates those inputs back to single ages during projection. Maybe now is the time to consider just using the WPP 1x1 estimates directly instead of going through this lossy aggregation/disaggregation process? Changing input editors and maintaining backwards compatibility with existing files seem like the two biggest hurdles.

jeffeaton commented 2 years ago

Births discrepancy is resolved in 0ac9ea9 by normalizing ASFR input as described by @rlglaubius in previous comment. Test on DemProj births and population comparison made more stringent.

jeffeaton commented 2 years ago

Under 5 net migration discrepancy is resolved in f510ebd and included in test.

In the DemProj only projection with net migration, the discrepancy in 80+ is slightly larger than other ages (up to about 0.5):

devtools::load_all()
pjnz1 <- "tests/testdata/spectrum/v6.13/bwa_demproj-only_spectrum-v6.13_2022-02-12.PJNZ"

demp1 <- prepare_leapfrog_demp(pjnz1)
hivp1 <- prepare_leapfrog_projp(pjnz1)

specres1 <- eppasm::read_hivproj_output(pjnz1)

lsim1 <- leapfrogR(demp1, hivp1)

diff80plus <- lsim1$totpop1[81,,] - specres1$totpop[81,,]

png("~/Downloads/diff80pl.png", h = 4.5, w = 6, units = "in", res = 180)
matplot(1970:2030, t(diff80plus), type = "l", lty = 1,
        main = "Difference in 80+ population: leapfrog - spectrum")
abline(h = 0, lty = 2)
legend("topleft", c("male", "female"), col = 1:2, lty = 1)
dev.off()

diff80pl

This might come from subtle difference in how the adjustment to the age group where net migrations is calculated or how migrations and survivorship are done in the open age group.

I am inclined to call this 'good enough' though and close this. What do you think @rlglaubius?

rlglaubius commented 2 years ago

The difference apparently has to do with the survival ratios applied to net migrants in the open-ended age group. Leapfrog applies the 80+ survival ratio, whereas Spectrum uses the average of 79 and 80+ survival ratios weighted by population sizes. To test, I made the following changes to leapfrog-raw.h:

--- a/inst/include/leapfrog-raw.h
+++ b/inst/include/leapfrog-raw.h
@@ -97,10 +97,15 @@ template <typename Type, int NG, int pAG, int pIDX_FERT, int pAG_FERT>
       totpop1(pAG-1, g, t) += totpop1(pAG-1, g, t-1) - natdeaths_open_age;

       // net migration
-      for(int a = 1; a < pAG; a++) {
+      for(int a = 1; a < pAG - 1; a++) {
         migrate_ag(a, g) = netmigr(a, g, t) * (1.0 + sx(a, g, t)) * 0.5 / totpop1(a, g, t);
         totpop1(a, g, t) *= 1.0 + migrate_ag(a, g);
       }
+
+      int a = pAG - 1;
+      Type numer = totpop1(a, g, t-1) * (1.0 + sx(a+1, g, t)) + totpop1(a-1, g, t-1) * (1.0 + sx(a, g, t));
+      Type denom = totpop1(a, g, t-1) + totpop1(a-1, g, t-1);
+      totpop1(a, g, t) += netmigr(a, g, t) * 0.5 * numer / denom;
     }

     // fertility

After making this change, the 80+ population difference is much smaller: diff80pl

My code changes are more verbose than necessary, and could probably be optimized by using some products of the preceding calculations.

jeffeaton commented 2 years ago

Hi @rlglaubius -- thanks, that's awesome.

And yes, you're right that it can be expressed in terms of the previous calculations for totpop1 and natdeaths in the open age group:

https://github.com/mrc-ide/leapfrog/blob/0c985624f87ee9f3691b2354398b6e8f85d621a2/inst/include/leapfrog-raw.h#L197-L205

The calculation could be unified without needing to handle the open age group as a special case if we used that same expression here: https://github.com/mrc-ide/leapfrog/blob/0c985624f87ee9f3691b2354398b6e8f85d621a2/inst/include/leapfrog-raw.h#L193

But I've just left it with the special case as maybe that is useful for clarity.

Closing this issue, and moving on to AIM components!