Closed jeanbaptisteb closed 4 months ago
Hi,
First I also want to thank for a useful library! I have also encountered the problem above and concluded that it is due to the fact that some cell in the crosstabulation is zero.
I would find it very useful if it would be possible to implement some functionality such that the other estimates could still be returned. For example just returning nan for the combination of values that does not have any observations and still estimate the point estimate and variance for the other combinations.
Do you have any thoughts of implementing such functionality?
Thank you in advance!
Hi,
Thank you for sharing these bugs. Indeed I am working on a solution to handle this issue. It should be included in the next minor release coming out soon.
With regards
Hi,
I have updated the package; please try and let me know if it fixes your issues.
Unfortunately, the documentation is not up to date. But basically, I refactored Tabulation() and CrossTabulation() to be more robust to empty cells.
Secondly, I added functionalities to estimate() from TaylorEstimator() to handle singletons. The options are guided by an ENUM class called SinglePSUEst; see parameter single_psu. The Enum takes these four options: error, skip, certainty, and combine.
class SinglePSUEst(Enum):
"""Estimation options for strata with singleton PSU"""
error = "Raise Error when one PSU in a stratum"
skip = "Set variance to zero and skip stratum with one PSU"
certainty = "Use SSUs or lowest units to estimate the variance"
combine = "Combine the strata with the singleton psu to another stratum"
Hence,
I plan to add more imputation-like options down the road, but I hope these options will take care of most situations.
pip install --upgrade samplics
does work for me. Let me know if you still have this issue.
BTW, I am testing the use of ENUM as opposed to strings. This is more robust from a programming point of view but strings may be more convenient for statisticians and data analysts. If you have an opinion, do not hesitate to share.
With regards
@MamadouSDiallo Thank you so much, that's great news! I'll test it in a couple of weeks; I'm a bit overworked for the moment, but I just created a reminder in my calendar.
Thank you @MamadouSDiallo, I really appreciate being able to communicate with you here. I was on vacation last week, hence my answer is a bit late.
I have now tried the updated version of the package if there are responses in all cells of the crosstabulation it works as before. However, I still get an error message, but a different one, if there is any cell in the crosstabulation of the variables of interest (considering only responding units) that is zero. Here comes the error message:
`
File "C:\Users.........\2475912047.py", line 2, in
File "C:\ProgramData\Anaconda3\envs.........\lib\site-packages\samplics\categorical\tabulation.py", line 503, in tabulate delta_est = np.linalg.inv(np.transpose(x2_tilde) @ cov_prop_srs @ x2_tilde) @ (
File "<__array_function__ internals>", line 180, in inv
File "C:\ProgramData\Anaconda3\envs............\lib\site-packages\numpy\linalg\linalg.py", line 545, in inv ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
File "C:\ProgramData\Anaconda3\envs............\lib\site-packages\numpy\linalg\linalg.py", line 88, in _raise_linalgerror_singular raise LinAlgError("Singular matrix")
LinAlgError: Singular matrix `
Do you have any idea how this can be fixed? Let me know if you want me to explain the error further.
Best regards
Hi
Will it be possible to create a small dummy dataset to reproduce the error?
Best regards
Hi,
Here comes two simple examples to reproduce two different errors that I receive when running tabulate() on my original data. In both cases there is one cell containing a zero-value. In the first example the error seem to arise from the fact that there is no combination in the data where an observation from group "two" answered "1" to question 1. In the second example the error seem to arise from estimating the delta_est matrix.
import pandas as pd
from samplics.categorical import CrossTabulation
# Example 1, returns error "KeyError: '1__by__two'"
dummy = {'q1': [1, 2, 2, 1, 2, 1, 2, 1, 2],
'group': ['one', 'one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'],
'nr_weight': [200, 123, 0, 0, 234, 123, 234, 0, 0],
'respondent': ['respondent', 'respondent',
'non-respondent', 'non-respondent',
'respondent', 'respondent',
'respondent', 'non-respondent',
'non-respondent']}
df_dummy = pd.DataFrame.from_dict(dummy)
pd.crosstab(df_dummy['q1'], df_dummy['group'])
crosstab_temp = CrossTabulation("proportion")
crosstab_temp.tabulate(
vars=df_dummy[['q1', 'group']],
samp_weight=df_dummy['nr_weight'],
remove_nan=True,
single_psu = 'skip')
print(crosstab_temp)
# Example 2, returns error "LinAlgError: Singular matrix"
dummy2 = {'q1': [1, 2, 2, 1, 2, 1, 2, 1, 2],
'group': ['one', 'one', 'two', 'one', 'two', 'one', 'two', 'one', 'one'],
'nr_weight': [200, 123, 0, 0, 234, 123, 234, 0, 123],
'respondent': ['respondent', 'respondent',
'non-respondent', 'non-respondent',
'respondent', 'respondent',
'respondent', 'non-respondent',
'respondent']}
df_dummy2 = pd.DataFrame.from_dict(dummy2)
pd.crosstab(df_dummy2['q1'], df_dummy2['group'])
crosstab_temp = CrossTabulation("proportion")
crosstab_temp.tabulate(
vars=df_dummy2[['q1', 'group']],
samp_weight=df_dummy2['nr_weight'],
remove_nan=True,
single_psu = 'skip')
print(crosstab_temp)
Can you see any explanations to these errors and can they be fixed somehow?
Thank you in advance!
Hi Let me know if 0.3.42 resolves this issue. Best
Hi,
It resolves the first issue (with relevant warnings). However, it does not solve the second error. I would be very happy if you could solve it or find some explanation in the data above why it happens.
Regards
Hi,
Any updates on the above?
Best
Hi
I made corrections back in November. Have you tested it since? Please do and let me know.
Best
Hi,
Thanks for your reply. The "LinAlgError: Singular matrix" in the second example remains. I've also encountered it using other data as well.
I would be very grateful if you could take a look at this "LinAlgError: Singular matrix" error and possible solutions to avoid it.
Best
I tested some of the data above and I am not getting this error, see code below.
What version of samplics are you using? Are you using the current one?
Can you paste the code here?
In [2]: # Example 2, returns error "LinAlgError: Singular matrix"
...: dummy2 = {'q1': [1, 2, 2, 1, 2, 1, 2, 1, 2],
...: 'group': ['one', 'one', 'two', 'one', 'two', 'one', 'two', 'one', 'one'],
...: 'nr_weight': [200, 123, 0, 0, 234, 123, 234, 0, 123],
...: 'respondent': ['respondent', 'respondent',
...: 'non-respondent', 'non-respondent',
...: 'respondent', 'respondent',
...: 'respondent', 'non-respondent',
...: 'respondent']}
...:
...: df_dummy2 = pd.DataFrame.from_dict(dummy2)
...:
...: pd.crosstab(df_dummy2['q1'], df_dummy2['group'])
...:
...: crosstab_temp = CrossTabulation("proportion")
...: crosstab_temp.tabulate(
...: vars=df_dummy2[['q1', 'group']],
...: samp_weight=df_dummy2['nr_weight'],
...: remove_nan=True,
...: single_psu = 'skip')
...: print(crosstab_temp)
Cross-tabulation of q1 and group
Number of strata: 1
Number of PSUs: 9
Number of observations: 9
Degrees of freedom: 8.00
q1 group proportion stderror lower_ci upper_ci
1 one 0.311475 0.203781 0.048134 0.801861
1 two 0.000000 0.000000 0.000000 0.000000
2 one 0.237223 0.167661 0.035412 0.724862
2 two 0.451302 0.229535 0.088433 0.874582
Pearson (with Rao-Scott adjustment):
Unadjusted - chi2(1): 3.3487 with p-value of 0.0673
Adjusted - F(1.00, 8.00): 2.7502 with p-value of 0.1358
Likelihood ratio (with Rao-Scott adjustment):
Unadjusted - chi2(1): 4.4098 with p-value of 0.0357
Adjusted - F(1.00, 8.00): 3.6216 with p-value of 0.0935
Thanks
Hi,
I updated to 0.4.1 today and now original example 2 works just as you've shown above. However, if I modify the example a bit, such that there is no respondent with q1==2 (I have such situations in my actual data, where no respondents picked a certain answer), the error remains. See the code for updated example 2 below.
# Example 2.2, returns error "LinAlgError: Singular matrix"
dummy2 = {'q1': [1, 2, 2, 1, 2, 1, 2, 1, 2],
'group': ['one', 'one', 'two', 'one', 'two', 'one', 'two', 'one', 'one'],
'nr_weight': [200, 0, 0, 234, 0, 234, 0, 123, 0],
'respondent': ['respondent', 'non-respondent',
'non-respondent', 'respondent',
'non-respondent', 'respondent',
'non-respondent', 'respondent',
'non-respondent']}
df_dummy2 = pd.DataFrame.from_dict(dummy2)
pd.crosstab(df_dummy2['q1'], df_dummy2['group'])
crosstab_temp = CrossTabulation("proportion")
crosstab_temp.tabulate(
vars=df_dummy2[['q1', 'group']],
samp_weight=df_dummy2['nr_weight'],
remove_nan=True,
single_psu = 'skip')
print(crosstab_temp)
Best
Thanks for noticing this edge case. In this case, there are no stats to compute and the frequency is 100% (only one category). But I will update the code to return a nicer message to the user and not fail.
Here is the ouput using 0.4.3
Cross-tabulation of q1 and group
Number of strata: 1
Number of PSUs: 4
Number of observations: 4
Degrees of freedom: 3.00
q1 group proportion stderror lower_ci upper_ci
1 one 1.0 0.0 1.0 1.0
Pearson (with Rao-Scott adjustment):
Unadjusted - chi2(0): 0.0000 with p-value of nan
Adjusted - F(0.00, 0.00): 0.0000 with p-value of nan
Likelihood ratio (with Rao-Scott adjustment):
Unadjusted - chi2(0): 0.0000 with p-value of nan
Adjusted - F(0.00, 0.00): 0.0000 with p-value of nan
Thank you!
I'm still having issues with "LinAlgError: Singular matrix" in my data though. Below you see the crosstab of a question with three options divided by education level with seven levels.
Running
crosstab_temp = CrossTabulation("proportion")
crosstab_temp.tabulate(
vars=resultat[['f2', 'utbniv']],
samp_weight=resultat['nr_weight'],
remove_nan=True)
with this data returns the Singular Matrix error. Do you have any idea why?
Best
Did you upgrade to the latest version: 0.4.3 ?
Yes I did! So example 2.2 gave the output you showed above, but for my data the error remains.
I do not see why it failed. Does it still fail when you remove the weight or use 1?
Unfortunately it still fails when I remove the weight or use 1.
Can you upgrade to 0.4.4, try it with your data and let me know?
Thank you, it seems to have solved the issue! :)
For some combinations of background variables and questions I now get other errors, regarding the dimensions. For example this combination
gives the following error.
Traceback (most recent call last):
File "C:\Users\widag\AppData\Local\Temp\ipykernel_67008\2246695084.py", line 2, in <module>
crosstab_temp.tabulate(
File "C:\ProgramData\Anaconda3\envs\Migrering\lib\site-packages\samplics\categorical\tabulation.py", line 573, in tabulate
x2_tilde = x2 - x1 @ np.linalg.inv(x1_t @ cov_prop_srs @ x1) @ (x1_t @ cov_prop_srs @ x2)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 19 is different from 20)
And this combination gives the error below
Traceback (most recent call last):
File "C:\Users\widag\AppData\Local\Temp\ipykernel_67008\106989882.py", line 2, in <module>
crosstab_temp.tabulate(
File "C:\ProgramData\Anaconda3\envs\Migrering\lib\site-packages\samplics\categorical\tabulation.py", line 568, in tabulate
cov_prop_srs = cov_prop_srs[nonnull_rows][:, nonnull_rows]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 46 but corresponding boolean dimension is 50
Any thoughts on these?
Thank you again for fixing the LinAlg error!
How about now, using 0.4.5?
Unfortunately 0.4.5 yields the same errors as above.
Ok. I will look into it more to see if I can reproduce the error. Thanks for sharing these bugs.
Hi,
Do you have any estimated time plan for a fix of these issues?
Best
Hi @agnesmw I was not able to reproduce the error. Can you share some data that can show the error?
Hi, and thanks for this very useful library!
I encountered a problem when trying to perform a crosstab analysis with it, and it took me some time to understand that the problem comes from contingency tables containing zeros.
It's simpler to explain with an example, so case in point, consider the following made-up weighted dataset, with 0 individual at the intersection "Man/Other nationality":
If you crosstab the data, you'll notice there are 0 man of "Other" nationality :
If I try to use samplics with it:
it throws the following error:
The problem disappear if I slightly change the dataset so the crosstab does not contain any zero anymore. I observed this problem with various datasets, so I'm almost certain the problem comes from these zero cells.
Is it supposed to throw an error like that?
I'm not certain if 1) the error simply comes from the fact that it's statistically incorrect to perform analysis with weights on crosstables containing zeros, or 2) if it's a case that the library doesn't take into account for the moment. If it's scenario 1), it might be useful to throw a more specific error message.
Anyway, here is my configuration, if it can help:
NB: for some reason,
pip install samplics --upgrade
won't upgrade samplics from 0.3.12 to 0.3.13, so I had to install the newest version directly from the repo withpip install git+https://github.com/samplics-org/samplics.git
. But anyway, the "cannot reshape array" message I encountered occurs in both versions, 0.3.12 and 0.3.13.Thanks again for this library!