stan-dev / cmdstanpy

CmdStanPy is a lightweight interface to Stan for Python users which provides the necessary objects and functions to compile a Stan program and fit the model to data using CmdStan.
BSD 3-Clause "New" or "Revised" License
153 stars 69 forks source link

Spaces in directory path cause errors in sampling #167

Closed wesbarnett closed 4 years ago

wesbarnett commented 4 years ago

Summary:

I was at the tutorial session at PyData NYC on Wednesday afternoon. I'm getting an error running some of the tutorial code on my Mac (but can run it fine on Linux).

Description:

Here is the python code, stan code, and data that are causing the error:

import os
from cmdstanpy import CmdStanModel
import pandas as pd

season_1975 = pd.read_csv('efron-morris-75-data.csv')
data_dict = {'N': season_1975.shape[0], 'y' : season_1975['Hits'].tolist(), 'K' : season_1975['At-Bats'].tolist()}

model_complete_pool = CmdStanModel(stan_file='simple_pool.stan')
model_complete_pool.compile()
complete_pool_fit = model_complete_pool.sample(data=data_dict)
complete_pool_fit.summary().round(decimals=2)
FirstName,LastName,At-Bats,Hits,BattingAverage,RemainingAt-Bats,RemainingAverage,SeasonAt-Bats,SeasonHits,SeasonAverage
Roberto,Clemente,45,18,0.4,367,0.346,412,145,0.352
Frank,Robinson,45,17,0.378,426,0.2981,471,144,0.306
Frank,Howard,45,16,0.356,521,0.2764,566,160,0.283
Jay,Johnstone,45,15,0.333,275,0.2218,320,76,0.238
Ken,Berry,45,14,0.311,418,0.2727,463,128,0.276
Jim,Spencer,45,14,0.311,466,0.2704,511,140,0.274
Don,Kessinger,45,13,0.289,586,0.2645,631,168,0.266
Luis,Alvarado,45,12,0.267,138,0.2101,183,41,0.224
Ron,Santo,45,11,0.244,510,0.2686,555,148,0.267
Ron,Swaboda,45,11,0.244,200,0.23,245,57,0.233
Rico,Petrocelli,45,10,0.222,538,0.2639,583,152,0.261
Ellie,Rodriguez,45,10,0.222,186,0.2258,231,52,0.225
George,Scott,45,10,0.222,435,0.3034,480,142,0.296
Del,Unser,45,10,0.222,277,0.2635,322,83,0.258
Billy,Williams,45,10,0.222,591,0.3299,636,205,0.251
Bert,Campaneris,45,9,0.2,558,0.2849,603,168,0.279
Thurman,Munson,45,8,0.178,408,0.3162,453,137,0.302
Max,Alvis,45,7,0.156,70,0.2,115,21,0.183
data {
  int<lower=0> N;           // items
  int<lower=0> K[N];        // initial trials
  int<lower=0> y[N];        // initial successes
}
parameters {
  real<lower=0, upper=1> phi;  // chance of success (pooled)
}
model {
  y ~ binomial(K, phi);  // likelihood
}

The error I am getting is the following:

INFO:cmdstanpy:stan to c++ (/var/folders/yl/v6d742w56rjdcr406hdxkfpr0000gn/T/tmpuy6wrr9z/tmpwf4d7zmm.hpp)
INFO:cmdstanpy:compiling c++
INFO:cmdstanpy:compiled model file: simple_pool
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:start chain 4
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    complete_pool_fit = model_complete_pool.sample(data=data_dict)
  File "/Users/<user>/Documents/PyData NYC 2019/bayesian_inference/venv/lib/python3.7/site-packages/cmdstanpy/model.py", line 615, in sample
    raise RuntimeError(msg)
RuntimeError: Error during sampling, chain 0 returned error code -1, chain 1 returned error code -1, chain 2 returned error code -1, chain 3 returned error code -1
deleting tmpfiles dir: /var/folders/yl/v6d742w56rjdcr406hdxkfpr0000gn/T/tmpjwvcen19
done

Additional Information:

This error only occurs on MacOS. It does not seem to be a problem on my Linux machine. I've tried reinstalling both cmdstan and cmdstanpy.

Current Version:

cmdstanpy: 0.6.0 cmdstan: 2.21.0

mitzimorris commented 4 years ago

did you upgrade your Mac to Catalina?

wesbarnett commented 4 years ago

No, I'm on Mojave (10.14.5). (Thanks for the quick response!)

mitzimorris commented 4 years ago
complete_pool_fit = model_complete_pool.sample(data=data_dict)

I think what's going on is that this model has a really hard time fitting, given the parameterization. each sampler chain has its own random seed and it initializes the parameter values randomly between -2 and 2 in order to get warmup off the ground. a bad combination of random values will probably kill it. so, 2 things to try:

a) first, run the sample command with csv_basename= set to a pathname:

complete_pool_fit = model_complete_pool.sample(data=data_dict, csv_basename='./foo')

this will create both csv files and txt files - the latter should have interesting error messages. also, any thoughts on https://github.com/stan-dev/cmdstanpy/issues/133 would be appreciated.

b) run the sample command setting adapt_delta at 0.9 or higher:

complete_pool_fit = model_complete_pool.sample(data=data_dict, adapt_delta='0.99', csv_basename='./foo')

CmdStanPy needs to do a better job when the sampler process throws an error - cf https://github.com/stan-dev/cmdstanpy/issues/141

wesbarnett commented 4 years ago

Unfortunately adding the csv output filename doesn't produce anything. It looks like it may be crashing before anything is output to the files. I also added adapt_delta and still get the same behavior, also with no output file (as an aside I had to change the parameter to a float from a String).

In contrast, I also ran the command from cmdstan outside of python and it seems to succeed (output not shown here):

./simple_pool data file=efron-morris-75-data.json sample

(I converted the csv to a json file):

{"N": 18, "y": [18, 17, 16, 15, 14, 14, 13, 12, 11, 11, 10, 10, 10, 10, 10, 9, 8, 7], "K": [45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45]}
mitzimorris commented 4 years ago

Doh! sorry, so none of the models work - simple_pool, simple_no_pool - OK - and if you run from CmdStanPy and specify json file as input instead of Dict, does that work?

complete_pool_fit = model_complete_pool.sample(data='efron-morris-75-data.json')
wesbarnett commented 4 years ago

Specifying the json file also does not work. None of the examples from the workshop work (in the Jupyter notebooks both for the women's soccer and baseball examples), but the basic example from the documentation does work:

import os
from cmdstanpy import CmdStanModel, cmdstan_path

bernoulli_path = os.path.join(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.stan')
bernoulli_model = CmdStanModel(stan_file=bernoulli_path)
bernoulli_model.compile()

bernoulli_data = { "N" : 10, "y" : [0,1,0,0,0,0,0,0,0,1] }
bernoulli_fit = bernoulli_model.sample(chains=5, cores=3, data=bernoulli_data)

bernoulli_fit.summary()
mitzimorris commented 4 years ago

what does the model object report about itself? please try this:

model_complete_pool = CmdStanModel(stan_file='simple_pool.stan')
model_complete_pool.compile()
print(model_complete_pool)
mitzimorris commented 4 years ago

BTW, thank you so much for your patience - you are a much valued beta tester!

mitzimorris commented 4 years ago

OK, recreated bug on my machine - it's the fact that there's a space in the directory name -

here = os.path.dirname(os.path.abspath('.'))
>>> here
'/Users/mitzi/Test Space'
wesbarnett commented 4 years ago

Oh wow, good catch! I can confirm, that was the case. After renaming my directory with spaces, it now runs. I would be willing to work on a fix. Would love to try to contribute.

ahartikainen commented 4 years ago

@mitzimorris has something changed in Maybe... class?

Main problem is probably the bug that we create a cmd string and split it with str.split(). (Not a problem in Windows where we use shortpaths).

mitzimorris commented 4 years ago

Main problem is probably the bug that we create a cmd string and split it with str.split().

exactly! @ahartikainen spotted this a while back - https://github.com/stan-dev/cmdstanpy/issues/91 @wesbarnett - good first issue?

wesbarnett commented 4 years ago

Sure, I'll tackle #91.

wesbarnett commented 4 years ago

This line is actually also an issue. I believe it needs to include the entire path.