simulate_glm not imported

glm-tools / pyglmnet

Python implementation of elastic-net regularized generalized linear models

http://glm-tools.github.io/pyglmnet/

MIT License

279 stars 83 forks source link

simulate_glm not imported #298

Closed notuntoward closed 4 years ago

notuntoward commented 5 years ago

The example code in the main README fails on the 3rd line:

ImportError: cannot import name 'simulate_glm' from 'pyglmnet'

This is after installing pyglmnet with pip.

notuntoward commented 5 years ago

This looks like a doc bug; the README seems to be a leftover from an old pyglmnet version in which simulate() wasn't a GLM object method. I changed the example code to call the simulate() method and it seems to work.

Is other documentation up to date enough that I could try pyglmnet on some experiments?

jasmainak commented 5 years ago

Actually this is a mismatch between the documentation of the stable version and the latest release. Sorry for this issue. We are going to make a release soon so the README should be accurate for the next release. Note that simulate_glm is the new way of doing things rather than a method.

notuntoward commented 5 years ago

Thanks for the quick reply. Should I wait for your new release or is what's on pip already usable?

jasmainak commented 5 years ago

I think you can already use the development version. It has some bug fixes too.

Just do:

$ pip install https://api.github.com/repos/glm-tools/pyglmnet/zipball/master

notuntoward commented 5 years ago

I used the new pip command to install pyglmnet-1.0.1.dev0 but the script in the README still fails:

File "C:\Users\Scott\Anaconda3\lib\site-packages\pyglmnet\pyglmnet.py", line 718, in fit
raise ValueError(msg)
ValueError: X must be a 2D ndarray, and y must be 1D

This is because the README script created a 1D numpy ndarray like so:

y_train = simulate_glm("poisson", beta0, beta, X_train)

but glm.fit() tests for a one dimensional y like this:

if X.ndim != 2 or y.ndim != 1:

The check doesn't work because:

y_train.shape ==  (1000, 1)
y_train.ndim == 2

To me, a (1000,1) array should be considered 1D. But if the rest of pyglmnet consistently assumes that 1D arrays always have shape == (n_samples, ), then I guess that's what simulate_glm() should return.

The simplest way is to force the shapes of simulate_glm()'s return values. For example:

return np.squeeze(_lmb(distr, beta0, beta, X, eta))

and

return np.squeeze(y)

jasmainak commented 5 years ago

sorry for the slow response.

I would not use squeeze because it's dangerous. You would not be able to distinguish between shape of (1, 1000) and (1000, 1). However, I'm okay with a more careful check like the one in scikit-learn. Would you like to take a stab at this? Perhaps @tommyod can also help since he introduced these checks.

tommyod commented 5 years ago

I'll take a look at it.

tommyod commented 5 years ago

We could just assume 1D vector means (n, ) and never (n, 1). But sklearn typically wants (n, 1) and not (n, ) - do we really want to go against their API here? As a further comment on this, I don't see any disadvantage of getting closer to the sklearn API, adding it as a dependency and using for instance their error checking capabilities "for free". I'm all for minimizing dependencies in general, but a person doing GLM is likely testing other models and has sklearn installed. Having a unified API would also make their life easier.
Why the long release cycle in general? Most people (quite naturally) assume that an example in the README is up to date with respect to pip. I would recommend a shorter release cycle to avoid issues like these.
It's problematic that examples are not run on Travis. I'll add a PR on this.

Thoughts on points 1) and 2) above @jasmainak ?

jasmainak commented 5 years ago

do we really want to go against their API here?

nopes, let's try to be sklearn-like as much as possible. It will help with adoption.

I'm all for minimizing dependencies in general, but a person doing GLM is likely testing other models and has sklearn installed

yes, I guess we could do this. But there isn't a strong coupling between the two projects at the moment to warrant this. Plus sklearn's list of dependencies are growing by the day. To be clear, I am not against this, just not sure if there is enough evidence that we need this at the moment.

Why the long release cycle in general?

No reason and we should make quicker releases. Any help in achieving this would be welcome :) We changed quite a bit the API after 1.0 to make it more sklearn-like which took time. Other than that, I agree shorter release cycles would be great!