Closed ljwolf closed 9 months ago
I think we still need to take a look at the way the size replicates are generated to make sure we've got the sample standard deviation of and standard error of the sample mean correct. @nanophyto, where's that in the code?
This is great, thanks. I took most of yesterday off to enjoy the sunshine, so I'll take a closer look today.
@ljwolf
I'm getting the following error when I set n_replicates to a value other than 1 in m.simulate_predictions()
.
The issue only occurs when conditional=False
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[4], [line 6](vscode-notebook-cell:?execution_count=4&line=6)
[4](vscode-notebook-cell:?execution_count=4&line=4) X_predict = [100, 99, 98, 101, 102, 100]*1000
[5](vscode-notebook-cell:?execution_count=4&line=5) m = regression_simulation(X_train, X_predict, Y_train)
----> [6](vscode-notebook-cell:?execution_count=4&line=6) test1 = m.simulate_predictions(boot=False, n_replicates=10)
[7](vscode-notebook-cell:?execution_count=4&line=7) sns.histplot(x=test1.flatten())
[8](vscode-notebook-cell:?execution_count=4&line=8) plt.show()
File [~/CoccoData/coccodata/regression.py:216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216), in regression_simulation.simulate_predictions(self, X_predict, params, conditional, n_replicates, boot)
[209](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:209) warnings.warn(
[210](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:210) "location parameters (`params`) provided, but unconditional distribution was requested. Ignoring the provided location parameters...",
[211](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:211) stacklevel=2,
[212](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:212) )
[213](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:213) params = self.simulate_regression_params(
[214](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:214) n_replicates=n_replicates, boot=boot
[215](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:215) )
--> [216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216) mus = X_predict @ params
[217](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:217) simulated_data = np.column_stack(
[218](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:218) [self.results.model.family.predict(mu_vec) for mu_vec in mus.T]
[219](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:219) )
[220](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:220) return simulated_data
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 10 is different from 1)
I think we still need to take a look at the way the size replicates are generated to make sure we've got the sample standard deviation of and standard error of the sample mean correct. @nanophyto, where's that in the code?
It is currently not part of the code since I still have to go back to the data sources and extract n (which, unfortunately, is unavailable for many studies). However, it would probably be easiest to add in library.py
https://github.com/nanophyto/CoccoData/blob/0931f3b222f785352961089ea1357b04c78d6112/coccodata/library.py#L72-L96
I have a folder of csvs (one for size, pic, and poc) which contains ['species', 'mean', and 'sd'] which library.py imports: https://github.com/nanophyto/CoccoData/blob/0931f3b222f785352961089ea1357b04c78d6112/coccodata/library.py#L51-L61
I can add n values to those csvs in the other scripts (pre_size. pre_pic etc.)
I will implement it today and open a separate issue so you can take a look.
@ljwolf
I'm getting the following error when I set n_replicates to a value other than 1 in
m.simulate_predictions()
.The issue only occurs when
conditional=False
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[4], [line 6](vscode-notebook-cell:?execution_count=4&line=6) [4](vscode-notebook-cell:?execution_count=4&line=4) X_predict = [100, 99, 98, 101, 102, 100]*1000 [5](vscode-notebook-cell:?execution_count=4&line=5) m = regression_simulation(X_train, X_predict, Y_train) ----> [6](vscode-notebook-cell:?execution_count=4&line=6) test1 = m.simulate_predictions(boot=False, n_replicates=10) [7](vscode-notebook-cell:?execution_count=4&line=7) sns.histplot(x=test1.flatten()) [8](vscode-notebook-cell:?execution_count=4&line=8) plt.show() File [~/CoccoData/coccodata/regression.py:216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216), in regression_simulation.simulate_predictions(self, X_predict, params, conditional, n_replicates, boot) [209](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:209) warnings.warn( [210](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:210) "location parameters (`params`) provided, but unconditional distribution was requested. Ignoring the provided location parameters...", [211](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:211) stacklevel=2, [212](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:212) ) [213](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:213) params = self.simulate_regression_params( [214](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:214) n_replicates=n_replicates, boot=boot [215](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:215) ) --> [216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216) mus = X_predict @ params [217](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:217) simulated_data = np.column_stack( [218](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:218) [self.results.model.family.predict(mu_vec) for mu_vec in mus.T] [219](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:219) ) [220](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:220) return simulated_data ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 10 is different from 1)
Sorry, I think that the input X_predict needs to be a column vector (numpy.asarray(X_predict).reshape(-1,1)
).
@ljwolf I'm getting the following error when I set n_replicates to a value other than 1 in
m.simulate_predictions()
. The issue only occurs whenconditional=False
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[4], [line 6](vscode-notebook-cell:?execution_count=4&line=6) [4](vscode-notebook-cell:?execution_count=4&line=4) X_predict = [100, 99, 98, 101, 102, 100]*1000 [5](vscode-notebook-cell:?execution_count=4&line=5) m = regression_simulation(X_train, X_predict, Y_train) ----> [6](vscode-notebook-cell:?execution_count=4&line=6) test1 = m.simulate_predictions(boot=False, n_replicates=10) [7](vscode-notebook-cell:?execution_count=4&line=7) sns.histplot(x=test1.flatten()) [8](vscode-notebook-cell:?execution_count=4&line=8) plt.show() File [~/CoccoData/coccodata/regression.py:216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216), in regression_simulation.simulate_predictions(self, X_predict, params, conditional, n_replicates, boot) [209](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:209) warnings.warn( [210](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:210) "location parameters (`params`) provided, but unconditional distribution was requested. Ignoring the provided location parameters...", [211](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:211) stacklevel=2, [212](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:212) ) [213](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:213) params = self.simulate_regression_params( [214](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:214) n_replicates=n_replicates, boot=boot [215](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:215) ) --> [216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216) mus = X_predict @ params [217](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:217) simulated_data = np.column_stack( [218](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:218) [self.results.model.family.predict(mu_vec) for mu_vec in mus.T] [219](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:219) ) [220](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:220) return simulated_data ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 10 is different from 1)
Sorry, I think that the input X_predict needs to be a column vector (
numpy.asarray(X_predict).reshape(-1,1)
).
No worries, that is a quick fix. Thanks!
@ljwolf Another bug /quick question: With the one-hot-encoding, the log transformation is resulting in some NaNs. A quick fix would be to say x = np.log(x+1). Is there any reason I shouldn't do this?
ValueError Traceback (most recent call last)
Cell In[8], line 12
9 X_train = d[['Volume', 'Phase_HET', 'Phase_HOL']].astype(float)
11 X_predict = [100, 99, 98, 101, 102, 100]*1000
---> 12 m = regression_simulation(X_train, X_predict, Y_train)
13 # test1 = m.simulate_data()
14 # sns.histplot(x=test1)
15 # plt.show()
16 m.return_performance()
File ~/CoccoData/coccodata/regression.py:53, in regression_simulation.__init__(self, X_train, X_predict, Y_train)
49 def __init__(self, X_train, X_predict, Y_train):
50 # won't this break ordering between Y and x?
51 # see @ljwolf's implementation later, in the
52 # regression coefficient example
---> 53 allometric_model = sm.GLM(
54 Y_train,
55 np.log(X_train),
56 family=sm.families.Gamma(link=sm.families.links.Log()),
57 )
58 # self.model = sm.GLM(Y_train, X_train, family=sm.families.Gaussian()) # sm.OLS(Y, X)
59 # print(X_train)
61 self._fit_function = lambda y, x: (
62 sm.GLM(
...
-> 1654 s = gufunc(a, signature=signature, extobj=extobj)
1655 s = s.astype(_realType(result_t), copy=False)
1656 return s
ValueError: On entry to DLASCL parameter number 5 had an illegal value
No, should be ok. There are differing opinions on how to do this, but we’re mainly interested in the predictions, so these debates are not as relevant.
ljw
From: Joost de Vries @.> Date: Monday, 29 January 2024 at 14:55 To: nanophyto/CoccoData @.> Cc: Levi John Wolf @.>, Mention @.> Subject: Re: [nanophyto/CoccoData] swap to conditional simulations (PR #2)
@ljwolfhttps://github.com/ljwolf Another bug /quick question: With the one-hot-encoding, the log transformation is resulting in some NaNs. A quick fix would be to say x = np.log(x+1). Is there any reason I shouldn't do this?
ValueError Traceback (most recent call last)
Cell In[8], line 12
9 X_train = d[['Volume', 'Phase_HET', 'Phase_HOL']].astype(float)
11 X_predict = [100, 99, 98, 101, 102, 100]*1000
---> 12 m = regression_simulation(X_train, X_predict, Y_train)
13 # test1 = m.simulate_data()
14 # sns.histplot(x=test1)
15 # plt.show()
16 m.return_performance()
File ~/CoccoData/coccodata/regression.py:53, in regression_simulation.init(self, X_train, X_predict, Y_train)
49 def __init__(self, X_train, X_predict, Y_train):
50 # won't this break ordering between Y and x?
51 # see @ljwolf's implementation later, in the
52 # regression coefficient example
---> 53 allometric_model = sm.GLM(
54 Y_train,
55 np.log(X_train),
56 family=sm.families.Gamma(link=sm.families.links.Log()),
57 )
58 # self.model = sm.GLM(Y_train, X_train, family=sm.families.Gaussian()) # sm.OLS(Y, X)
59 # print(X_train)
61 self._fit_function = lambda y, x: (
62 sm.GLM(
...
-> 1654 s = gufunc(a, signature=signature, extobj=extobj)
1655 s = s.astype(_realType(result_t), copy=False)
1656 return s
ValueError: On entry to DLASCL parameter number 5 had an illegal value
— Reply to this email directly, view it on GitHubhttps://github.com/nanophyto/CoccoData/pull/2#issuecomment-1914862045, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AARFR44SGGUHDY4KBV43IELYQ6Z6NAVCNFSM6AAAAABCLCG7OCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJUHA3DEMBUGU. You are receiving this because you were mentioned.Message ID: @.***>
@ljwolf Another question... When I run the model with one-hot-encoding it starts predicting negative values (both when bootstrap=False
and when bootstrap=True
).
d = pd.read_csv("/home/phyto/CoccoData/data/unprocessed/rosie_size_pic.csv")
d = d.dropna()
d = d[d['PIC pg C'] >0]
d = d[d['Volume'] >0]
Y_train = d['PIC pg C']
d = pd.get_dummies(d, columns=['Phase'], dtype=float)
X_train = d[['Volume', 'Phase_HET', 'Phase_HOL']].astype(float)
X_predict = pd.DataFrame({'Volume': np.random.gamma(8, 2, 1000),
'Phase_HET':[0]*1000,
'Phase_HOL':[1]*1000})
m = regression_simulation(X_train, X_predict, Y_train)
c_simulation = m.simulate_predictions(boot=True, n_replicates=1)
np.min(c_simulation)
>>> -36.04365338911715
@ljwolf there also seems to be an issue with the model hugely underpredicting Y_pred (i.e. values of 6 when values of 200 are expected) It seems that X_predict is not log transformed which might be the issue?
This swaps the simulator to a conditional simulation. You can see that, generally, there isn't quite a big difference for the poulton data, one of a few different things @nanophyto and I talked about on the call today.
We should chat a bit more about what the next steps here are. Check the end of the file for usage, and hopefully that's clear in terms of what we need to simulate.
The uncertainty in the relationship between coccolith size and POC is modelled by the
simulate_regression_params()
. From this function, you get random replicates of the possible estimated relationship between size and POC.Then, we can use these to get uncertainty about the final POC estimate from a set of input size replicates. Schematically:
the
simulate_predictions()
function can work with only the first source of uncertainty whenconditional=True
, but then propagates the second source of uncertainty whenconditional=False
.