"cannot reshape array" error message with crosstabs containing 0-value cells (samplics 0.3.12 and 0.3.13)

jeanbaptisteb commented 2 years ago

Hi, and thanks for this very useful library!

I encountered a problem when trying to perform a crosstab analysis with it, and it took me some time to understand that the problem comes from contingency tables containing zeros.

It's simpler to explain with an example, so case in point, consider the following made-up weighted dataset, with 0 individual at the intersection "Man/Other nationality":

import pandas
from samplics.categorical import Tabulation, CrossTabulation
df= pandas.DataFrame(data=
                        [["Woman", "European"]]*100 + \
                        [["Woman", "American"]]* 35 + \
                        [["Woman", "Other"]]*93 + \
                        [["Man", "European"]]*150 + \
                        [["Man", "American"]]*77,
                     columns=["Gender", "Nationality"])
df["weights"] = [1, 0.3, 8, 3, 0.7]  * 91
#Let's preview the data
print(df.head(3).append(df.tail(3)))

Gender	Nationality	weights
0	Woman	European	1.0
1	Woman	European	0.3
2	Woman	European	8.0
...	...	...	...
452	Man	American	8.0
453	Man	American	3.0
454	Man	American	0.7

If you crosstab the data, you'll notice there are 0 man of "Other" nationality :

pandas.crosstab(df["Nationality"],
                df["Gender"])

Gender	Man	Woman
Nationality
American	77	35
European	150	100
Other	0	93

If I try to use samplics with it:

crosstab_samplics = CrossTabulation("count")
crosstab_samplics.tabulate(
    vars=df[["Gender", "Nationality"]],
    samp_weight=df["weights"],
    remove_nan=True,
)

it throws the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\me\env\lib\site-packages\samplics\categorical\tabulation.py", line 430, in tabulate
    - cell_est.reshape(vars_levels.shape[0], 1)
ValueError: cannot reshape array of size 5 into shape (6,1)

The problem disappear if I slightly change the dataset so the crosstab does not contain any zero anymore. I observed this problem with various datasets, so I'm almost certain the problem comes from these zero cells.

Is it supposed to throw an error like that?

I'm not certain if 1) the error simply comes from the fact that it's statistically incorrect to perform analysis with weights on crosstables containing zeros, or 2) if it's a case that the library doesn't take into account for the moment. If it's scenario 1), it might be useful to throw a more specific error message.

Anyway, here is my configuration, if it can help:

Windows 10, Python 3.10.2 (tags/v3.10.2:a58ebcc), [MSC v.1929 64 bit (AMD64)]
the code is executed in a virtual environment
the following possibly relevant packages are installed:
- numpy 1.22.3,
- pandas 1.4.1,
- statsmodels 0.13.2,
- matplotlib 3.5.1,
- scipy 1.8.0

NB: for some reason, pip install samplics --upgrade won't upgrade samplics from 0.3.12 to 0.3.13, so I had to install the newest version directly from the repo with pip install git+https://github.com/samplics-org/samplics.git . But anyway, the "cannot reshape array" message I encountered occurs in both versions, 0.3.12 and 0.3.13.

Thanks again for this library!

agnesmw commented 2 years ago

Hi,

First I also want to thank for a useful library! I have also encountered the problem above and concluded that it is due to the fact that some cell in the crosstabulation is zero.

I would find it very useful if it would be possible to implement some functionality such that the other estimates could still be returned. For example just returning nan for the combination of values that does not have any observations and still estimate the point estimate and variance for the other combinations.

Do you have any thoughts of implementing such functionality?

Thank you in advance!

MamadouSDiallo commented 2 years ago

Hi,

Thank you for sharing these bugs. Indeed I am working on a solution to handle this issue. It should be included in the next minor release coming out soon.

With regards

MamadouSDiallo commented 2 years ago

Hi,

I have updated the package; please try and let me know if it fixes your issues.

Unfortunately, the documentation is not up to date. But basically, I refactored Tabulation() and CrossTabulation() to be more robust to empty cells.

Secondly, I added functionalities to estimate() from TaylorEstimator() to handle singletons. The options are guided by an ENUM class called SinglePSUEst; see parameter single_psu. The Enum takes these four options: error, skip, certainty, and combine.

class SinglePSUEst(Enum):
    """Estimation options for strata with singleton PSU"""

    error = "Raise Error when one PSU in a stratum"
    skip = "Set variance to zero and skip stratum with one PSU"
    certainty = "Use SSUs or lowest units to estimate the variance"
    combine = "Combine the strata with the singleton psu to another stratum"

Hence,

single_psu=SinglePSUEst.error: Let it crash. This good for the first run to identify your singletons
single_psu=SinglePSUEst.skip: skip the singletons and set the variance to 0.
single_psu=SinglePSUEst.certainty: treat the singletons as certainties and use SSU (if provided) or the individual records to estimate the variance
single_psu=SinglePSUEst.combine: combine the singletons strata to other strata. You will have to specify strata_comb. strata_comb is a dictionary to map the old strata to the new strata i.e. {old_stratum1: new_stratum1, old_stratum2: new_straum2, ...}

I plan to add more imputation-like options down the road, but I hope these options will take care of most situations.

pip install --upgrade samplics does work for me. Let me know if you still have this issue.

BTW, I am testing the use of ENUM as opposed to strings. This is more robust from a programming point of view but strings may be more convenient for statisticians and data analysts. If you have an opinion, do not hesitate to share.

With regards

jeanbaptisteb commented 2 years ago

@MamadouSDiallo Thank you so much, that's great news! I'll test it in a couple of weeks; I'm a bit overworked for the moment, but I just created a reminder in my calendar.

agnesmw commented 2 years ago

Thank you @MamadouSDiallo, I really appreciate being able to communicate with you here. I was on vacation last week, hence my answer is a bit late.

I have now tried the updated version of the package if there are responses in all cells of the crosstabulation it works as before. However, I still get an error message, but a different one, if there is any cell in the crosstabulation of the variables of interest (considering only responding units) that is zero. Here comes the error message:

` File "C:\Users.........\2475912047.py", line 2, in crosstab_temp.tabulate(

File "C:\ProgramData\Anaconda3\envs.........\lib\site-packages\samplics\categorical\tabulation.py", line 503, in tabulate delta_est = np.linalg.inv(np.transpose(x2_tilde) @ cov_prop_srs @ x2_tilde) @ (

File "<__array_function__ internals>", line 180, in inv

File "C:\ProgramData\Anaconda3\envs............\lib\site-packages\numpy\linalg\linalg.py", line 545, in inv ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)

File "C:\ProgramData\Anaconda3\envs............\lib\site-packages\numpy\linalg\linalg.py", line 88, in _raise_linalgerror_singular raise LinAlgError("Singular matrix")

LinAlgError: Singular matrix `

Do you have any idea how this can be fixed? Let me know if you want me to explain the error further.

Best regards

MamadouSDiallo commented 2 years ago

Hi

Will it be possible to create a small dummy dataset to reproduce the error?

Best regards

agnesmw commented 2 years ago

Hi,

Here comes two simple examples to reproduce two different errors that I receive when running tabulate() on my original data. In both cases there is one cell containing a zero-value. In the first example the error seem to arise from the fact that there is no combination in the data where an observation from group "two" answered "1" to question 1. In the second example the error seem to arise from estimating the delta_est matrix.

import pandas as pd
from samplics.categorical import CrossTabulation

# Example 1, returns error "KeyError: '1__by__two'"
dummy = {'q1': [1, 2, 2, 1, 2, 1, 2, 1, 2], 
         'group': ['one', 'one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'],
         'nr_weight': [200, 123, 0, 0, 234, 123, 234, 0, 0], 
         'respondent': ['respondent', 'respondent', 
                        'non-respondent', 'non-respondent', 
                        'respondent', 'respondent',
                        'respondent', 'non-respondent',
                        'non-respondent']}

df_dummy = pd.DataFrame.from_dict(dummy)

pd.crosstab(df_dummy['q1'], df_dummy['group'])

crosstab_temp = CrossTabulation("proportion")
crosstab_temp.tabulate(
vars=df_dummy[['q1', 'group']],
samp_weight=df_dummy['nr_weight'],
remove_nan=True,
single_psu = 'skip')
print(crosstab_temp)

# Example 2, returns error "LinAlgError: Singular matrix"
dummy2 = {'q1': [1, 2, 2, 1, 2, 1, 2, 1, 2], 
         'group': ['one', 'one', 'two', 'one', 'two', 'one', 'two', 'one', 'one'],
         'nr_weight': [200, 123, 0, 0, 234, 123, 234, 0, 123], 
         'respondent': ['respondent', 'respondent', 
                        'non-respondent', 'non-respondent', 
                        'respondent', 'respondent',
                        'respondent', 'non-respondent',
                        'respondent']}

df_dummy2 = pd.DataFrame.from_dict(dummy2)

pd.crosstab(df_dummy2['q1'], df_dummy2['group'])

crosstab_temp = CrossTabulation("proportion")
crosstab_temp.tabulate(
vars=df_dummy2[['q1', 'group']],
samp_weight=df_dummy2['nr_weight'],
remove_nan=True,
single_psu = 'skip')
print(crosstab_temp)

Can you see any explanations to these errors and can they be fixed somehow?

Thank you in advance!

MamadouSDiallo commented 2 years ago

Hi Let me know if 0.3.42 resolves this issue. Best

agnesmw commented 2 years ago

Hi,

It resolves the first issue (with relevant warnings). However, it does not solve the second error. I would be very happy if you could solve it or find some explanation in the data above why it happens.

Regards

agnesmw commented 1 year ago

Hi,

Any updates on the above?

Best

MamadouSDiallo commented 1 year ago

Hi

I made corrections back in November. Have you tested it since? Please do and let me know.

Best

agnesmw commented 1 year ago

Hi,

Thanks for your reply. The "LinAlgError: Singular matrix" in the second example remains. I've also encountered it using other data as well.

I would be very grateful if you could take a look at this "LinAlgError: Singular matrix" error and possible solutions to avoid it.

Best

MamadouSDiallo commented 1 year ago

I tested some of the data above and I am not getting this error, see code below.

What version of samplics are you using? Are you using the current one?

Can you paste the code here?

In [2]: # Example 2, returns error "LinAlgError: Singular matrix"
   ...: dummy2 = {'q1': [1, 2, 2, 1, 2, 1, 2, 1, 2],
   ...:          'group': ['one', 'one', 'two', 'one', 'two', 'one', 'two', 'one', 'one'],
   ...:          'nr_weight': [200, 123, 0, 0, 234, 123, 234, 0, 123],
   ...:          'respondent': ['respondent', 'respondent',
   ...:                         'non-respondent', 'non-respondent',
   ...:                         'respondent', 'respondent',
   ...:                         'respondent', 'non-respondent',
   ...:                         'respondent']}
   ...: 
   ...: df_dummy2 = pd.DataFrame.from_dict(dummy2)
   ...: 
   ...: pd.crosstab(df_dummy2['q1'], df_dummy2['group'])
   ...: 
   ...: crosstab_temp = CrossTabulation("proportion")
   ...: crosstab_temp.tabulate(
   ...: vars=df_dummy2[['q1', 'group']],
   ...: samp_weight=df_dummy2['nr_weight'],
   ...: remove_nan=True,
   ...: single_psu = 'skip')
   ...: print(crosstab_temp)

Cross-tabulation of q1 and group
 Number of strata: 1
 Number of PSUs: 9
 Number of observations: 9
 Degrees of freedom: 8.00

 q1 group  proportion  stderror  lower_ci  upper_ci
 1   one    0.311475  0.203781  0.048134  0.801861
 1   two    0.000000  0.000000  0.000000  0.000000
 2   one    0.237223  0.167661  0.035412  0.724862
 2   two    0.451302  0.229535  0.088433  0.874582

Pearson (with Rao-Scott adjustment):
    Unadjusted - chi2(1): 3.3487 with p-value of 0.0673
    Adjusted - F(1.00, 8.00): 2.7502  with p-value of 0.1358

  Likelihood ratio (with Rao-Scott adjustment):
    Unadjusted - chi2(1): 4.4098 with p-value of 0.0357
    Adjusted - F(1.00, 8.00): 3.6216  with p-value of 0.0935

Thanks

agnesmw commented 1 year ago

Hi,

I updated to 0.4.1 today and now original example 2 works just as you've shown above. However, if I modify the example a bit, such that there is no respondent with q1==2 (I have such situations in my actual data, where no respondents picked a certain answer), the error remains. See the code for updated example 2 below.

# Example 2.2, returns error "LinAlgError: Singular matrix"
dummy2 = {'q1': [1, 2, 2, 1, 2, 1, 2, 1, 2], 
         'group': ['one', 'one', 'two', 'one', 'two', 'one', 'two', 'one', 'one'],
         'nr_weight': [200, 0, 0, 234, 0, 234, 0, 123, 0], 
         'respondent': ['respondent', 'non-respondent', 
                        'non-respondent', 'respondent', 
                        'non-respondent', 'respondent',
                        'non-respondent', 'respondent',
                        'non-respondent']}

df_dummy2 = pd.DataFrame.from_dict(dummy2)

pd.crosstab(df_dummy2['q1'], df_dummy2['group'])

crosstab_temp = CrossTabulation("proportion")
crosstab_temp.tabulate(
vars=df_dummy2[['q1', 'group']],
samp_weight=df_dummy2['nr_weight'],
remove_nan=True,
single_psu = 'skip')
print(crosstab_temp)

Best

MamadouSDiallo commented 1 year ago

Thanks for noticing this edge case. In this case, there are no stats to compute and the frequency is 100% (only one category). But I will update the code to return a nicer message to the user and not fail.

MamadouSDiallo commented 1 year ago

Here is the ouput using 0.4.3

Cross-tabulation of q1 and group
 Number of strata: 1
 Number of PSUs: 4
 Number of observations: 4
 Degrees of freedom: 3.00

 q1 group  proportion  stderror  lower_ci  upper_ci
 1   one         1.0       0.0       1.0       1.0

Pearson (with Rao-Scott adjustment):
    Unadjusted - chi2(0): 0.0000 with p-value of nan
    Adjusted - F(0.00, 0.00): 0.0000  with p-value of nan

  Likelihood ratio (with Rao-Scott adjustment):
    Unadjusted - chi2(0): 0.0000 with p-value of nan
    Adjusted - F(0.00, 0.00): 0.0000  with p-value of nan

agnesmw commented 1 year ago

Thank you!

I'm still having issues with "LinAlgError: Singular matrix" in my data though. Below you see the crosstab of a question with three options divided by education level with seven levels.

Running

crosstab_temp = CrossTabulation("proportion")
crosstab_temp.tabulate(
vars=resultat[['f2', 'utbniv']],
samp_weight=resultat['nr_weight'],
remove_nan=True)

with this data returns the Singular Matrix error. Do you have any idea why?

Best

MamadouSDiallo commented 1 year ago

Did you upgrade to the latest version: 0.4.3 ?

agnesmw commented 1 year ago

Yes I did! So example 2.2 gave the output you showed above, but for my data the error remains.

MamadouSDiallo commented 1 year ago

I do not see why it failed. Does it still fail when you remove the weight or use 1?

agnesmw commented 1 year ago

Unfortunately it still fails when I remove the weight or use 1.

MamadouSDiallo commented 1 year ago

Can you upgrade to 0.4.4, try it with your data and let me know?

agnesmw commented 1 year ago

Thank you, it seems to have solved the issue! :)

For some combinations of background variables and questions I now get other errors, regarding the dimensions. For example this combination

gives the following error.

Traceback (most recent call last):

  File "C:\Users\widag\AppData\Local\Temp\ipykernel_67008\2246695084.py", line 2, in <module>
    crosstab_temp.tabulate(

  File "C:\ProgramData\Anaconda3\envs\Migrering\lib\site-packages\samplics\categorical\tabulation.py", line 573, in tabulate
    x2_tilde = x2 - x1 @ np.linalg.inv(x1_t @ cov_prop_srs @ x1) @ (x1_t @ cov_prop_srs @ x2)

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 19 is different from 20)

And this combination gives the error below

Traceback (most recent call last):

  File "C:\Users\widag\AppData\Local\Temp\ipykernel_67008\106989882.py", line 2, in <module>
    crosstab_temp.tabulate(

  File "C:\ProgramData\Anaconda3\envs\Migrering\lib\site-packages\samplics\categorical\tabulation.py", line 568, in tabulate
    cov_prop_srs = cov_prop_srs[nonnull_rows][:, nonnull_rows]

IndexError: boolean index did not match indexed array along dimension 0; dimension is 46 but corresponding boolean dimension is 50

Any thoughts on these?

Thank you again for fixing the LinAlg error!

MamadouSDiallo commented 1 year ago

How about now, using 0.4.5?

agnesmw commented 1 year ago

Unfortunately 0.4.5 yields the same errors as above.

MamadouSDiallo commented 1 year ago

Ok. I will look into it more to see if I can reproduce the error. Thanks for sharing these bugs.

agnesmw commented 1 year ago

Hi,

Do you have any estimated time plan for a fix of these issues?

Best

MamadouSDiallo commented 1 year ago

Hi @agnesmw I was not able to reproduce the error. Can you share some data that can show the error?

samplics-org / samplics

"cannot reshape array" error message with crosstabs containing 0-value cells (samplics 0.3.12 and 0.3.13) #37