read_models throws error

asifzubair commented 9 years ago

Ran 8 models and tried to look at the output using the read_models method but got an error. Models ran were - random forest, LDA, gaussian naive bayes, logistic regression. Perhaps it is because the models have different parameters.

the truncated error stack is attached below.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-a1b8bc97b6b3> in <module>()
----> 1 models = explore.read_models('modelling/output_models_ELiNbLd')

/home/azubair/drain/explore.pyc in read_models(dirname, estimator)
     45 def read_models(dirname, estimator=True):
     46     df = pd.concat((read_model(subdir, estimator) for subdir in get_subdirs(dirname)), ignore_index=True)
---> 47     calculate_metrics(df)
     48 
     49     return df
...
...
...
/opt/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in extract_index(data)
   4844             lengths = list(set(raw_lengths))
   4845             if len(lengths) > 1:
-> 4846                 raise ValueError('arrays must all be same length')
   4847 
   4848             if have_dicts:

ValueError: arrays must all be same length

potash commented 9 years ago

Please include the full trace. I'm guessing that naive bayes or LDA has a slightly different interface than other models

asifzubair commented 9 years ago

sure, here's the full trace.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-a1b8bc97b6b3> in <module>()
----> 1 models = explore.read_models('modelling/output_models_ELiNbLd')

/home/azubair/drain/explore.pyc in read_models(dirname, estimator)
     45 def read_models(dirname, estimator=True):
     46     df = pd.concat((read_model(subdir, estimator) for subdir in get_subdirs(dirname)), ignore_index=True)
---> 47     calculate_metrics(df)
     48 
     49     return df

/home/azubair/drain/explore.pyc in calculate_metrics(df)
     57     df['baseline']=df.y.apply(lambda y: y.true.sum()*1.0/len(y.true))
     58 
---> 59     df['coef'] = [get_coef(row) for i,row in df.iterrows()]
     60 
     61     return df

/home/azubair/drain/explore.pyc in get_coef(row)
     63 def get_coef(row):
     64     if hasattr(row['estimator'], 'coef_'):
---> 65         return pd.DataFrame({'name':row['columns'], 'c':row['estimator'].coef_[0]}).sort('c')
     66     else:
     67         return pd.DataFrame()

/opt/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    212                                  dtype=dtype, copy=copy)
    213         elif isinstance(data, dict):
--> 214             mgr = self._init_dict(data, index, columns, dtype=dtype)
    215         elif isinstance(data, ma.MaskedArray):
    216             import numpy.ma.mrecords as mrecords

/opt/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _init_dict(self, data, index, columns, dtype)
    339 
    340         return _arrays_to_mgr(arrays, data_names, index, columns,
--> 341                               dtype=dtype)
    342 
    343     def _init_ndarray(self, values, index, columns, dtype=None,

/opt/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   4796     # figure out the index, if necessary
   4797     if index is None:
-> 4798         index = extract_index(arrays)
   4799     else:
   4800         index = _ensure_index(index)

/opt/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in extract_index(data)
   4844             lengths = list(set(raw_lengths))
   4845             if len(lengths) > 1:
-> 4846                 raise ValueError('arrays must all be same length')
   4847 
   4848             if have_dicts:

ValueError: arrays must all be same length

potash commented 9 years ago

Yeah must be an inconsistency in sklearn coef_ attributes between some the models. I'm at the airport but comment line 59, the call to get_coef, of explore.py for a temporary workaround.

potash / drain

read_models throws error #4