Issue with TaylorEstimator("proportion") by domain on certain data

kevintcaron commented 4 months ago

Hi @MamadouSDiallo,

I am attempting to calculate smoking prevalence ('_RFSMOK3') by race ('_RACE') with BRFSS data. This seems to work fine for 2018 BRFSS data, but for 2019 data I am getting the error below. I thought maybe it was a divide by 0 error, but it appears there are plenty of respondents for each of the racial groups. I have also provided my code below the error, but it will require download of the 2019 BRFSS data (or 2018 for it to work). Do you know what is causing this issue?

ERROR:

[...\Lib\site-packages\samplics\estimation\expansion.py:161](file:///C:/Users/wrn0/.conda/envs/ntcp_env/Lib/site-packages/samplics/estimation/expansion.py#line=160): RuntimeWarning: invalid value encountered in scalar divide
  return float(np.sum(samp_weight * y) / np.sum(samp_weight))
[...\Lib\site-packages\samplics\estimation\expansion.py:304](file:///C:/Users/wrn0/.conda/envs/ntcp_env/Lib/site-packages/samplics/estimation/expansion.py#line=303): RuntimeWarning: invalid value encountered in divide
  location_weights = np.sum(y_weighted, axis=0) / scale_weights
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[37], line 27
     26 stratified_proportion_smoking = TaylorEstimator("proportion")
---> 27 stratified_proportion_smoking.estimate(
     28     y=df[smoker_var],
     29     samp_weight=df[weight],
     30     stratum=df["_STSTR"],
     31     psu=df["_PSU"],
     32     domain=df[domain],
     33     single_psu=SinglePSUEst.skip,
     34     remove_nan=True)
     36 df_smoking = stratified_proportion_smoking.to_dataframe()

File [~\.conda\envs\ntcp_env\Lib\site-packages\samplics\estimation\expansion.py:833](http://localhost:8888/lab/tree/~/.conda/envs/ntcp_env/Lib/site-packages/samplics/estimation/expansion.py#line=832), in TaylorEstimator.estimate(self, y, samp_weight, x, stratum, psu, ssu, domain, by, fpc, deff, coef_var, single_psu, strata_comb, as_factor, remove_nan)
    830 self.as_factor = as_factor
    832 if _by.shape in ((), (0,)):
--> 833     self._estimate(
    834         y=_y,
    835         samp_weight=_samp_weight,
    836         x=_x,
    837         stratum=_stratum,
    838         psu=_psu,
    839         ssu=_ssu,
    840         domain=_domain,
    841         fpc=self.fpc,
    842         deff=deff,
    843         coef_var=coef_var,
    844         skipped_strata=skipped_strata,
    845         as_factor=as_factor,
    846         remove_nan=remove_nan,
    847     )
    848 else:
    849     for b in self.by:

File [~\.conda\envs\ntcp_env\Lib\site-packages\samplics\estimation\expansion.py:637](http://localhost:8888/lab/tree/~/.conda/envs/ntcp_env/Lib/site-packages/samplics/estimation/expansion.py#line=636), in TaylorEstimator._estimate(self, y, samp_weight, x, stratum, psu, ssu, domain, fpc, deff, coef_var, skipped_strata, as_factor, remove_nan)
    635 coef_var = {}
    636 for level in self.variance[key]:
--> 637     point_est1 = self.point_est[key]
    638     variance1 = self.variance[key]
    639     if isinstance(point_est1, dict) and isinstance(variance1, dict):

KeyError: nan

CODE:

import pandas as pd
import samplics
import numpy as np
from samplics.estimation import TaylorEstimator
from samplics.utils import SinglePSUEst

df_2019 = pd.read_sas(r'LLCP2019.XPT')
smoker_var = '_RFSMOK3'
weight = '_LLCPWT'
domain = '_RACE'
df = df_2019.copy()

# RECODE smoker variable
# Replace 1 with 0
df.loc[df[smoker_var] == 1, smoker_var] = 0
# Replace 2 with 1
df.loc[df[smoker_var] == 2, smoker_var] = 1
# Replace 9 with nan
df[smoker_var] = df[smoker_var].replace(9, np.nan)

stratified_proportion_smoking = TaylorEstimator("proportion")
stratified_proportion_smoking.estimate(
    y=df[smoker_var],
    samp_weight=df[weight],
    stratum=df["_STSTR"],
    psu=df["_PSU"],
    domain=df[domain],
    single_psu=SinglePSUEst.skip,
    remove_nan=True)

df_smoking = stratified_proportion_smoking.to_dataframe()

MamadouSDiallo commented 4 months ago

Hi @kevintcaron

With recent versions of samplics, you must use an enum to indicate the population parameter. For a proportion, it becomes PopParam.prop; please look at my edits to your code. There were also issues with samplics caused by missing values with the domain variable.

Please update to the latest version and retry. Let me know if you are still having issues.

import numpy as np
import pandas as pd

from samplics.estimation import TaylorEstimator
from samplics.utils import PopParam, SinglePSUEst

df_2019 = pd.read_sas(r"./issues/number56/LLCP2019.XPT")
smoker_var = "_RFSMOK3"
weight = "_LLCPWT"
domain = "_RACE"
df = df_2019.copy()

# RECODE smoker variable
# Replace 1 with 0
df.loc[df[smoker_var] == 1, smoker_var] = 0
# Replace 2 with 1
df.loc[df[smoker_var] == 2, smoker_var] = 1
# Replace 9 with nan
df[smoker_var] = df[smoker_var].replace(9, np.nan)

stratified_proportion_smoking = TaylorEstimator(PopParam.prop)
stratified_proportion_smoking.estimate(
    y=df[smoker_var],
    samp_weight=df[weight],
    stratum=df["_STSTR"],
    psu=df["_PSU"],
    domain=df[domain],
    single_psu=SinglePSUEst.skip,
    remove_nan=True,
)

df_smoking = stratified_proportion_smoking.to_dataframe()

kevintcaron commented 4 months ago

Thank you for the quick response! This solved my issue. Please let me know if there is anything I can do to help to support continued development!

samplics-org / samplics

Issue with TaylorEstimator("proportion") by domain on certain data #56