samplics-org / samplics

Select, weight and analyze complex sample data
https://samplics-org.github.io/samplics/
MIT License
57 stars 11 forks source link

CrossTabulation Error on certain data #58

Open kevintcaron opened 2 months ago

kevintcaron commented 2 months ago

Hi @MamadouSDiallo,

Sorry to add another issue, but hopefully this is easier to resolve. I am attempting to tabulate smoking prevalence ('_RFSMOK3') by imputed race ('_IMPRACE') with BRFSS data. This works fine for all territories including Puerto Rico from 2018-2022. However, when using the 2021 BRFSS data for Puerto Rico I am getting the error below. I have also provided my code below the error, but it will require download of the 2021 BRFSS data (or other years for it to work). Do you know what is causing this issue?

ERROR:

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
Cell In[247], line 12
      8 # df = recode_smoker(df)
      9 # df = recode_smokeless(df)
     11 stratified_count = CrossTabulation(param=PopParam.count)
---> 12 stratified_count.tabulate(
     13     vars=df[[variable, '_IMPRACE']], 
     14     samp_weight=df['_LLCPWT'],
     15     stratum=df['_STSTR'],
     16     psu=df['_PSU'],
     17     single_psu=SinglePSUEst.skip,
     18     remove_nan=True)   
     20 df_out = stratified_count.to_dataframe()
     21 df_out

File [~\.conda\envs\ntcp_env\Lib\site-packages\samplics\categorical\tabulation.py:598](http://localhost:8888/lab/tree/~/.conda/envs/ntcp_env/Lib/site-packages/samplics/categorical/tabulation.py#line=597), in CrossTabulation.tabulate(self, vars, varnames, samp_weight, stratum, psu, ssu, fpc, deff, coef_var, single_psu, strata_comb, remove_nan)
    593 # TODO:
    594 # Replace the inversion of cov_prop and cov_prop_srs below by the a multiplication
    595 # we have that inv(x' V x) = inv(x' L L' x) = z'z where z = inv(L) x
    596 # L is the Cholesky factor i.e. L = np.linalg.cholesky(V)
    597 x1_t = np.transpose(x1)
--> 598 x2_tilde = x2 - x1 @ np.linalg.inv(x1_t @ cov_prop_srs @ x1) @ (
    599     x1_t @ cov_prop_srs @ x2
    600 )
    602 delta_est = np.linalg.inv(np.transpose(x2_tilde) @ cov_prop_srs @ x2_tilde) @ (
    603     np.transpose(x2_tilde) @ cov_prop @ x2_tilde  # TODO: is it cov_prop_srs
    604 )
    606 tbl_keys = list(tbl_est.point_est.keys())

File [~\.conda\envs\ntcp_env\Lib\site-packages\numpy\linalg\_linalg.py:608](http://localhost:8888/lab/tree/~/.conda/envs/ntcp_env/Lib/site-packages/numpy/linalg/_linalg.py#line=607), in inv(a)
    605 signature = 'D->D' if isComplexType(t) else 'd->d'
    606 with errstate(call=_raise_linalgerror_singular, invalid='call',
    607               over='ignore', divide='ignore', under='ignore'):
--> 608     ainv = _umath_linalg.inv(a, signature=signature)
    609 return wrap(ainv.astype(result_t, copy=False))

File [~\.conda\envs\ntcp_env\Lib\site-packages\numpy\linalg\_linalg.py:104](http://localhost:8888/lab/tree/~/.conda/envs/ntcp_env/Lib/site-packages/numpy/linalg/_linalg.py#line=103), in _raise_linalgerror_singular(err, flag)
    103 def _raise_linalgerror_singular(err, flag):
--> 104     raise LinAlgError("Singular matrix")

LinAlgError: Singular matrix

CODE:

import pandas as pd
import samplics
import numpy as np
from samplics.estimation import TaylorEstimator, ReplicateEstimator
from samplics.categorical import Tabulation, CrossTabulation
from samplics.utils import PopParam, SinglePSUEst

df_2021 = pd.read_sas(r'LLCP2021.XPT')
df = df_2021.copy()
df = df[df['_STATE'] == 72]
recipient = 'Puerto Rico'
year = 2021
demo = _imprace.copy()
variable = '_RFSMOK3'

# df = recode_smoker(df)
# df = recode_smokeless(df)

stratified_count = CrossTabulation(param=PopParam.count)
stratified_count.tabulate(
    vars=df[[variable, '_IMPRACE']], 
    samp_weight=df['_LLCPWT'],
    stratum=df['_STSTR'],
    psu=df['_PSU'],
    single_psu=SinglePSUEst.skip,
    remove_nan=True)   

df_out = stratified_count.to_dataframe()
MamadouSDiallo commented 1 month ago

Hi @kevintcaron

The error is due to the fact that one cell (_RFSMOK3=2, _LLCPWT=4) has only one observation. I am debating how to handle this situation.

Best regards