nanophyto / CASCADE

MIT License
1 stars 0 forks source link

swap to conditional simulations #2

Closed ljwolf closed 9 months ago

ljwolf commented 9 months ago

This swaps the simulator to a conditional simulation. You can see that, generally, there isn't quite a big difference for the poulton data, one of a few different things @nanophyto and I talked about on the call today.

We should chat a bit more about what the next steps here are. Check the end of the file for usage, and hopefully that's clear in terms of what we need to simulate.

The uncertainty in the relationship between coccolith size and POC is modelled by the simulate_regression_params(). From this function, you get random replicates of the possible estimated relationship between size and POC.

Then, we can use these to get uncertainty about the final POC estimate from a set of input size replicates. Schematically:

  1. the size replicates we build using the sample reconstruction measure the natural variability in measurements of species size, as well as the measurement error of species size
  2. the slope replicates that this PR models measure the estimation error relating coccosphere size to POC.

the simulate_predictions() function can work with only the first source of uncertainty when conditional=True, but then propagates the second source of uncertainty when conditional=False.

ljwolf commented 9 months ago

I think we still need to take a look at the way the size replicates are generated to make sure we've got the sample standard deviation of and standard error of the sample mean correct. @nanophyto, where's that in the code?

nanophyto commented 9 months ago

This is great, thanks. I took most of yesterday off to enjoy the sunshine, so I'll take a closer look today.

nanophyto commented 9 months ago

@ljwolf

I'm getting the following error when I set n_replicates to a value other than 1 in m.simulate_predictions().

The issue only occurs when conditional=False

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], [line 6](vscode-notebook-cell:?execution_count=4&line=6)
      [4](vscode-notebook-cell:?execution_count=4&line=4) X_predict = [100, 99, 98, 101, 102, 100]*1000
      [5](vscode-notebook-cell:?execution_count=4&line=5) m = regression_simulation(X_train, X_predict, Y_train)
----> [6](vscode-notebook-cell:?execution_count=4&line=6) test1 = m.simulate_predictions(boot=False, n_replicates=10)
      [7](vscode-notebook-cell:?execution_count=4&line=7) sns.histplot(x=test1.flatten())
      [8](vscode-notebook-cell:?execution_count=4&line=8) plt.show()

File [~/CoccoData/coccodata/regression.py:216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216), in regression_simulation.simulate_predictions(self, X_predict, params, conditional, n_replicates, boot)
    [209](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:209)         warnings.warn(
    [210](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:210)             "location parameters (`params`)  provided, but unconditional distribution was requested. Ignoring the provided location parameters...",
    [211](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:211)             stacklevel=2,
    [212](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:212)         )
    [213](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:213)     params = self.simulate_regression_params(
    [214](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:214)         n_replicates=n_replicates, boot=boot
    [215](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:215)     )
--> [216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216)     mus = X_predict @ params
    [217](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:217)     simulated_data = np.column_stack(
    [218](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:218)         [self.results.model.family.predict(mu_vec) for mu_vec in mus.T]
    [219](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:219)     )
    [220](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:220) return simulated_data

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 10 is different from 1)
nanophyto commented 9 months ago

I think we still need to take a look at the way the size replicates are generated to make sure we've got the sample standard deviation of and standard error of the sample mean correct. @nanophyto, where's that in the code?

It is currently not part of the code since I still have to go back to the data sources and extract n (which, unfortunately, is unavailable for many studies). However, it would probably be easiest to add in library.py https://github.com/nanophyto/CoccoData/blob/0931f3b222f785352961089ea1357b04c78d6112/coccodata/library.py#L72-L96

I have a folder of csvs (one for size, pic, and poc) which contains ['species', 'mean', and 'sd'] which library.py imports: https://github.com/nanophyto/CoccoData/blob/0931f3b222f785352961089ea1357b04c78d6112/coccodata/library.py#L51-L61

I can add n values to those csvs in the other scripts (pre_size. pre_pic etc.)

I will implement it today and open a separate issue so you can take a look.

ljwolf commented 9 months ago

@ljwolf

I'm getting the following error when I set n_replicates to a value other than 1 in m.simulate_predictions().

The issue only occurs when conditional=False

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], [line 6](vscode-notebook-cell:?execution_count=4&line=6)
      [4](vscode-notebook-cell:?execution_count=4&line=4) X_predict = [100, 99, 98, 101, 102, 100]*1000
      [5](vscode-notebook-cell:?execution_count=4&line=5) m = regression_simulation(X_train, X_predict, Y_train)
----> [6](vscode-notebook-cell:?execution_count=4&line=6) test1 = m.simulate_predictions(boot=False, n_replicates=10)
      [7](vscode-notebook-cell:?execution_count=4&line=7) sns.histplot(x=test1.flatten())
      [8](vscode-notebook-cell:?execution_count=4&line=8) plt.show()

File [~/CoccoData/coccodata/regression.py:216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216), in regression_simulation.simulate_predictions(self, X_predict, params, conditional, n_replicates, boot)
    [209](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:209)         warnings.warn(
    [210](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:210)             "location parameters (`params`)  provided, but unconditional distribution was requested. Ignoring the provided location parameters...",
    [211](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:211)             stacklevel=2,
    [212](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:212)         )
    [213](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:213)     params = self.simulate_regression_params(
    [214](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:214)         n_replicates=n_replicates, boot=boot
    [215](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:215)     )
--> [216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216)     mus = X_predict @ params
    [217](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:217)     simulated_data = np.column_stack(
    [218](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:218)         [self.results.model.family.predict(mu_vec) for mu_vec in mus.T]
    [219](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:219)     )
    [220](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:220) return simulated_data

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 10 is different from 1)

Sorry, I think that the input X_predict needs to be a column vector (numpy.asarray(X_predict).reshape(-1,1)).

nanophyto commented 9 months ago

@ljwolf I'm getting the following error when I set n_replicates to a value other than 1 in m.simulate_predictions(). The issue only occurs when conditional=False

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], [line 6](vscode-notebook-cell:?execution_count=4&line=6)
      [4](vscode-notebook-cell:?execution_count=4&line=4) X_predict = [100, 99, 98, 101, 102, 100]*1000
      [5](vscode-notebook-cell:?execution_count=4&line=5) m = regression_simulation(X_train, X_predict, Y_train)
----> [6](vscode-notebook-cell:?execution_count=4&line=6) test1 = m.simulate_predictions(boot=False, n_replicates=10)
      [7](vscode-notebook-cell:?execution_count=4&line=7) sns.histplot(x=test1.flatten())
      [8](vscode-notebook-cell:?execution_count=4&line=8) plt.show()

File [~/CoccoData/coccodata/regression.py:216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216), in regression_simulation.simulate_predictions(self, X_predict, params, conditional, n_replicates, boot)
    [209](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:209)         warnings.warn(
    [210](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:210)             "location parameters (`params`)  provided, but unconditional distribution was requested. Ignoring the provided location parameters...",
    [211](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:211)             stacklevel=2,
    [212](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:212)         )
    [213](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:213)     params = self.simulate_regression_params(
    [214](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:214)         n_replicates=n_replicates, boot=boot
    [215](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:215)     )
--> [216](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:216)     mus = X_predict @ params
    [217](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:217)     simulated_data = np.column_stack(
    [218](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:218)         [self.results.model.family.predict(mu_vec) for mu_vec in mus.T]
    [219](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:219)     )
    [220](https://file+.vscode-resource.vscode-cdn.net/home/phyto/CoccoData/coccodata/notebooks/~/CoccoData/coccodata/regression.py:220) return simulated_data

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 10 is different from 1)

Sorry, I think that the input X_predict needs to be a column vector (numpy.asarray(X_predict).reshape(-1,1)).

No worries, that is a quick fix. Thanks!

nanophyto commented 9 months ago

@ljwolf Another bug /quick question: With the one-hot-encoding, the log transformation is resulting in some NaNs. A quick fix would be to say x = np.log(x+1). Is there any reason I shouldn't do this?

ValueError                                Traceback (most recent call last)
Cell In[8], line 12
      9 X_train = d[['Volume', 'Phase_HET', 'Phase_HOL']].astype(float)
     11 X_predict = [100, 99, 98, 101, 102, 100]*1000
---> 12 m = regression_simulation(X_train, X_predict, Y_train)
     13 # test1 = m.simulate_data()
     14 # sns.histplot(x=test1)
     15 # plt.show()
     16 m.return_performance()

File ~/CoccoData/coccodata/regression.py:53, in regression_simulation.__init__(self, X_train, X_predict, Y_train)
     49 def __init__(self, X_train, X_predict, Y_train):
     50     # won't this break ordering between Y and x?
     51     # see @ljwolf's implementation later, in the
     52     # regression coefficient example
---> 53     allometric_model = sm.GLM(
     54         Y_train,
     55         np.log(X_train),
     56         family=sm.families.Gamma(link=sm.families.links.Log()),
     57     )
     58     # self.model = sm.GLM(Y_train, X_train, family=sm.families.Gaussian()) # sm.OLS(Y, X)
     59     # print(X_train)
     61     self._fit_function = lambda y, x: (
     62         sm.GLM(
...
-> 1654 s = gufunc(a, signature=signature, extobj=extobj)
   1655 s = s.astype(_realType(result_t), copy=False)
   1656 return s

ValueError: On entry to DLASCL parameter number 5 had an illegal value
ljwolf commented 9 months ago

No, should be ok. There are differing opinions on how to do this, but we’re mainly interested in the predictions, so these debates are not as relevant.


ljw

From: Joost de Vries @.> Date: Monday, 29 January 2024 at 14:55 To: nanophyto/CoccoData @.> Cc: Levi John Wolf @.>, Mention @.> Subject: Re: [nanophyto/CoccoData] swap to conditional simulations (PR #2)

@ljwolfhttps://github.com/ljwolf Another bug /quick question: With the one-hot-encoding, the log transformation is resulting in some NaNs. A quick fix would be to say x = np.log(x+1). Is there any reason I shouldn't do this?

ValueError Traceback (most recent call last)

Cell In[8], line 12

  9 X_train = d[['Volume', 'Phase_HET', 'Phase_HOL']].astype(float)

 11 X_predict = [100, 99, 98, 101, 102, 100]*1000

---> 12 m = regression_simulation(X_train, X_predict, Y_train)

 13 # test1 = m.simulate_data()

 14 # sns.histplot(x=test1)

 15 # plt.show()

 16 m.return_performance()

File ~/CoccoData/coccodata/regression.py:53, in regression_simulation.init(self, X_train, X_predict, Y_train)

 49 def __init__(self, X_train, X_predict, Y_train):

 50     # won't this break ordering between Y and x?

 51     # see @ljwolf's implementation later, in the

 52     # regression coefficient example

---> 53 allometric_model = sm.GLM(

 54         Y_train,

 55         np.log(X_train),

 56         family=sm.families.Gamma(link=sm.families.links.Log()),

 57     )

 58     # self.model = sm.GLM(Y_train, X_train, family=sm.families.Gaussian()) # sm.OLS(Y, X)

 59     # print(X_train)

 61     self._fit_function = lambda y, x: (

 62         sm.GLM(

...

-> 1654 s = gufunc(a, signature=signature, extobj=extobj)

1655 s = s.astype(_realType(result_t), copy=False)

1656 return s

ValueError: On entry to DLASCL parameter number 5 had an illegal value

— Reply to this email directly, view it on GitHubhttps://github.com/nanophyto/CoccoData/pull/2#issuecomment-1914862045, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AARFR44SGGUHDY4KBV43IELYQ6Z6NAVCNFSM6AAAAABCLCG7OCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJUHA3DEMBUGU. You are receiving this because you were mentioned.Message ID: @.***>

nanophyto commented 9 months ago

@ljwolf Another question... When I run the model with one-hot-encoding it starts predicting negative values (both when bootstrap=False and when bootstrap=True).

d = pd.read_csv("/home/phyto/CoccoData/data/unprocessed/rosie_size_pic.csv")
d = d.dropna()
d = d[d['PIC pg C'] >0]
d = d[d['Volume'] >0]
Y_train = d['PIC pg C']

d = pd.get_dummies(d, columns=['Phase'], dtype=float)
X_train = d[['Volume', 'Phase_HET', 'Phase_HOL']].astype(float)

X_predict = pd.DataFrame({'Volume': np.random.gamma(8, 2, 1000), 
                    'Phase_HET':[0]*1000, 
                    'Phase_HOL':[1]*1000})

m = regression_simulation(X_train, X_predict, Y_train)

c_simulation = m.simulate_predictions(boot=True, n_replicates=1)
np.min(c_simulation)

>>> -36.04365338911715
nanophyto commented 9 months ago

@ljwolf there also seems to be an issue with the model hugely underpredicting Y_pred (i.e. values of 6 when values of 200 are expected) image It seems that X_predict is not log transformed which might be the issue?