kr-colab / FILET

Software for detecting introgression using supervised machine learning
GNU General Public License v3.0
18 stars 5 forks source link

Error on training with example dataset #2

Closed stsmall closed 6 years ago

stsmall commented 6 years ago

Hi @andrewkern, @dschride I followed the example downloaded with FILET, but seem to be running into an error during the training step, specifically that trainFiletClassifier.py stops with an error.

any help or suggestion are greatly appreciated! thanks, @stsmall

python 2.7 (anaconda version) scipy v1.0.1 numpy v1.13.3 sklearn v0.19.2

anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20. DeprecationWarning) training set size after balancing: 29940 Checking accuracy when distinguishing among all 3 classes Using extraTreesClassifier Traceback (most recent call last): File "trainFiletClassifier.py", line 81, in grid_search.fit(X, y) File "anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py", line 838, in fit return self._fit(X, y, ParameterGrid(self.param_grid)) File "anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py", line 574, in _fit for parameters in parameter_iterable File "anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 789, in call self.retrieve() File "anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 740, in retrieve raise exception sklearn.externals.joblib.my_exceptions.JoblibValueError: JoblibValueError


Multiprocessing exception: ........................................................................... FILET/trainFiletClassifier.py in () 76 clf, mlType, paramGrid = ExtraTreesClassifier(n_estimators=100, random_state=0), "extraTreesClassifier", param_grid_forest 77 78 sys.stderr.write("Using %s\n" %(mlType)) 79 grid_search = GridSearchCV(clf,param_grid=param_grid_forest,cv=10,n_jobs=10) 80 start = time() ---> 81 grid_search.fit(X, y) 82 sys.stderr.write("GridSearchCV took %.2f seconds for %d candidate parameter settings.\n" 83 % (time() - start, len(grid_search.gridscores))) 84 print "Results for %s" %(mlType) 85 report(grid_search.gridscores)

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py in fit(self=GridSearchCV(cv=10, error_score='raise', ...='2n_jobs', refit=True, scoring=None, verbose=0), X=array([[ 6.53900000e-03, 1.00000000e-06, 3....543860e+02, 1.00000000e+00, 1.00000000e+00]]), y=['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...]) 833 y : array-like, shape = [n_samples] or [n_samples, n_output], optional 834 Target relative to X for classification or regression; 835 None for unsupervised learning. 836 837 """ --> 838 return self._fit(X, y, ParameterGrid(self.param_grid)) self._fit = <bound method GridSearchCV._fit of GridSearchCV(...'2n_jobs', refit=True, scoring=None, verbose=0)> X = array([[ 6.53900000e-03, 1.00000000e-06, 3....543860e+02, 1.00000000e+00, 1.00000000e+00]]) y = ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...] self.param_grid = {'bootstrap': [True, False], 'criterion': ['gini', 'entropy'], 'max_depth': [3, 10, None], 'max_features': [1, 3, 4, 22], 'min_samples_leaf': [1, 3, 10], 'min_samples_split': [1, 3, 10]} 839 840 841 class RandomizedSearchCV(BaseSearchCV): 842 """Randomized search on hyper parameters.

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py in _fit(self=GridSearchCV(cv=10, error_score='raise', ...='2*n_jobs', refit=True, scoring=None, verbose=0), X=array([[ 6.53900000e-03, 1.00000000e-06, 3....543860e+02, 1.00000000e+00, 1.00000000e+00]]), y=['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...], parameter_iterable=) 569 )( 570 delayed(_fit_and_score)(clone(baseestimator), X, y, self.scorer, 571 train, test, self.verbose, parameters, 572 self.fit_params, return_parameters=True, 573 error_score=self.error_score) --> 574 for parameters in parameter_iterable parameters = undefined parameter_iterable = 575 for train, test in cv) 576 577 # Out is a list of triplet: score, estimator, n_test_samples 578 n_fits = len(out)

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in call(self=Parallel(n_jobs=10), iterable=<generator object >) 784 if pre_dispatch == "all" or n_jobs == 1: 785 # The iterable was consumed all at once by the above for loop. 786 # No need to wait for async callbacks to trigger to 787 # consumption. 788 self._iterating = False --> 789 self.retrieve() self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=10)> 790 # Make sure that we get a last message telling us we are done 791 elapsed_time = time.time() - self._start_time 792 self._print('Done %3i out of %3i | elapsed: %s finished', 793 (len(self._output), len(self._output),


Sub-process traceback:

ValueError Thu Aug 23 12:04:10 2018 PID: 51133Python 2.7.14: anaconda2/bin/python ........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in call(self=) 126 def init(self, iterator_slice): 127 self.items = list(iterator_slice) 128 self._size = len(self.items) 129 130 def call(self): --> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items] func = args = (ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), memmap([[ 6.53900000e-03, 1.00000000e-06, 3...543860e+02, 1.00000000e+00, 1.00000000e+00]]), ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...], , array([ 998, 999, 1000, ..., 29937, 29938, 29939]), array([ 0, 1, 2, ..., 20955, 20956, 20957]), 0, {'bootstrap': True, 'criterion': 'gini', 'max_depth': 3, 'max_features': 1, 'min_samples_leaf': 1, 'min_samples_split': 1}, {}) kwargs = {'error_score': 'raise', 'return_parameters': True} self.items = [(, (ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), memmap([[ 6.53900000e-03, 1.00000000e-06, 3...543860e+02, 1.00000000e+00, 1.00000000e+00]]), ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...], , array([ 998, 999, 1000, ..., 29937, 29938, 29939]), array([ 0, 1, 2, ..., 20955, 20956, 20957]), 0, {'bootstrap': True, 'criterion': 'gini', 'max_depth': 3, 'max_features': 1, 'min_samples_leaf': 1, 'min_samples_split': 1}, {}), {'error_score': 'raise', 'return_parameters': True})] 132 133 def len(self): 134 return self._size 135

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator=ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), X=memmap([[ 6.53900000e-03, 1.00000000e-06, 3...543860e+02, 1.00000000e+00, 1.00000000e+00]]), y=['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...], scorer=, train=array([ 998, 999, 1000, ..., 29937, 29938, 29939]), test=array([ 0, 1, 2, ..., 20955, 20956, 20957]), verbose=0, parameters={'bootstrap': True, 'criterion': 'gini', 'max_depth': 3, 'max_features': 1, 'min_samples_leaf': 1, 'min_samples_split': 1}, fit_params={}, return_train_score=False, return_parameters=True, error_score='raise') 1670 1671 try: 1672 if y_train is None: 1673 estimator.fit(X_train, fit_params) 1674 else: -> 1675 estimator.fit(X_train, y_train, fit_params) estimator.fit = <bound method ExtraTreesClassifier.fit of ExtraT...se, random_state=0, verbose=0, warm_start=False)> X_train = memmap([[ 1.05620000e-02, 1.00000000e-06, 5...543860e+02, 1.00000000e+00, 1.00000000e+00]]) y_train = ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...] fit_params = {} 1676 1677 except Exception as e: 1678 if error_score == 'raise': 1679 raise

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/ensemble/forest.py in fit(self=ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), X=array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), y=array([[ 0.], [ 0.], [ 0.], ..., [ 2.], [ 2.], [ 2.]]), sample_weight=None) 323 trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, 324 backend="threading")( 325 delayed(_parallel_build_trees)( 326 t, self, X, y, sample_weight, i, len(trees), 327 verbose=self.verbose, class_weight=self.classweight) --> 328 for i, t in enumerate(trees)) i = 99 329 330 # Collect newly grown trees 331 self.estimators.extend(trees) 332

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in call(self=Parallel(n_jobs=1), iterable=<generator object >) 774 self.n_completed_tasks = 0 775 try: 776 # Only set self._iterating to True if at least a batch 777 # was dispatched. In particular this covers the edge 778 # case of Parallel used with an exhausted iterator. --> 779 while self.dispatch_one_batch(iterator): self.dispatch_one_batch = <bound method Parallel.dispatch_one_batch of Parallel(n_jobs=1)> iterator = <generator object > 780 self._iterating = True 781 else: 782 self._iterating = False 783

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self=Parallel(n_jobs=1), iterator=<generator object >) 620 tasks = BatchedCalls(itertools.islice(iterator, batch_size)) 621 if len(tasks) == 0: 622 # No more tasks available in the iterator: tell caller to stop. 623 return False 624 else: --> 625 self._dispatch(tasks) self._dispatch = <bound method Parallel._dispatch of Parallel(n_jobs=1)> tasks = 626 return True 627 628 def _print(self, msg, msg_args): 629 """Display the message on stout or stderr depending on verbosity"""

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self=Parallel(n_jobs=1), batch=) 583 self.n_dispatched_tasks += len(batch) 584 self.n_dispatched_batches += 1 585 586 dispatch_timestamp = time.time() 587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self) --> 588 job = self._backend.apply_async(batch, callback=cb) job = undefined self._backend.apply_async = <bound method SequentialBackend.apply_async of <...lib._parallel_backends.SequentialBackend object>> batch = cb = 589 self._jobs.append(job) 590 591 def dispatch_next(self): 592 """Dispatch more data for parallel processing

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self=, func=, callback=) 106 raise ValueError('n_jobs == 0 in Parallel has no meaning') 107 return 1 108 109 def apply_async(self, func, callback=None): 110 """Schedule a func to be run""" --> 111 result = ImmediateResult(func) result = undefined func = 112 if callback: 113 callback(result) 114 return result 115

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py in init(self=, batch=) 327 328 class ImmediateResult(object): 329 def init(self, batch): 330 # Don't delay the application, to avoid keeping the input 331 # arguments in memory --> 332 self.results = batch() self.results = undefined batch = 333 334 def get(self): 335 return self.results 336

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in call(self=) 126 def init(self, iterator_slice): 127 self.items = list(iterator_slice) 128 self._size = len(self.items) 129 130 def call(self): --> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items] func = args = (ExtraTreeClassifier(class_weight=None, criterion...dom_state=209652396, splitter='random'), ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), array([[ 0.], [ 0.], [ 0.], ..., [ 2.], [ 2.], [ 2.]]), None, 0, 100) kwargs = {'class_weight': None, 'verbose': 0} self.items = [(, (ExtraTreeClassifier(class_weight=None, criterion...dom_state=209652396, splitter='random'), ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), array([[ 0.], [ 0.], [ 0.], ..., [ 2.], [ 2.], [ 2.]]), None, 0, 100), {'class_weight': None, 'verbose': 0})] 132 133 def len(self): 134 return self._size 135

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/ensemble/forest.py in _parallel_build_trees(tree=ExtraTreeClassifier(class_weight=None, criterion...dom_state=209652396, splitter='random'), forest=ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), X=array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), y=array([[ 0.], [ 0.], [ 0.], ..., [ 2.], [ 2.], [ 2.]]), sample_weight=None, tree_idx=0, n_trees=100, verbose=0, class_weight=None) 116 warnings.simplefilter('ignore', DeprecationWarning) 117 curr_sample_weight = compute_sample_weight('auto', y, indices) 118 elif class_weight == 'balanced_subsample': 119 curr_sample_weight = compute_sample_weight('balanced', y, indices) 120 --> 121 tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False) tree.fit = <bound method ExtraTreeClassifier.fit of ExtraTr...om_state=209652396, splitter='random')> X = array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32) y = array([[ 0.], [ 0.], [ 0.], ..., [ 2.], [ 2.], [ 2.]]) sample_weight = None curr_sample_weight = array([ 0., 0., 1., ..., 0., 1., 0.]) 122 else: 123 tree.fit(X, y, sample_weight=sample_weight, check_input=False) 124 125 return tree

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.py in fit(self=ExtraTreeClassifier(class_weight=None, criterion...dom_state=209652396, splitter='random'), X=array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), y=array([[ 0.], [ 0.], [ 0.], ..., [ 2.], [ 2.], [ 2.]]), sample_weight=array([ 0., 0., 1., ..., 0., 1., 0.]), check_input=False, X_idx_sorted=None) 785 786 super(DecisionTreeClassifier, self).fit( 787 X, y, 788 sample_weight=sample_weight, 789 check_input=check_input, --> 790 X_idx_sorted=X_idx_sorted) X_idx_sorted = None 791 return self 792 793 def predict_proba(self, X, check_input=True): 794 """Predict class probabilities of the input samples X.

........................................................................... anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.py in fit(self=ExtraTreeClassifier(class_weight=None, criterion...dom_state=209652396, splitter='random'), X=array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), y=array([[ 0.], [ 0.], [ 0.], ..., [ 2.], [ 2.], [ 2.]]), sample_weight=array([ 0., 0., 1., ..., 0., 1., 0.]), check_input=False, X_idx_sorted=None) 189 if isinstance(self.min_samples_split, (numbers.Integral, np.integer)): 190 if not 2 <= self.min_samples_split: 191 raise ValueError("min_samples_split must be an integer " 192 "greater than 1 or a float in (0.0, 1.0]; " 193 "got the integer %s" --> 194 % self.min_samples_split) self.min_samples_split = 1 195 min_samples_split = self.min_samples_split 196 else: # float 197 if not 0. < self.min_samples_split <= 1.: 198 raise ValueError("min_samples_split must be an integer "

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1


andrewkern commented 6 years ago

this looks like a version issue to me-- seems like sklearn has now changed the grid search function. Can you roll back your version of sklearn to a point where you are no longer getting the deprecation warning and see if it works? I think that might be v0.18

stsmall commented 6 years ago

Hi @andrewkern, I downgraded to v0.18.0 and while the deprecation warning is still there, the errors are not. It will take a bit until I know for sure, but seems like it did the trick! thanks for the assist!

andrewkern commented 6 years ago

now we just have to update the code to deal with how scikit-learn broke it.... ugh

dschride commented 6 years ago

Asdf

On Thu, Aug 23, 2018 at 12:39 PM Andrew Kern notifications@github.com wrote:

now we just have to update the code to deal with how scikit-learn broke it.... ugh

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub https://github.com/kern-lab/FILET/issues/2#issuecomment-415505791, or mute the thread https://github.com/notifications/unsubscribe-auth/AnjwXimU1frNa4R2QGfMcionFkpySnPEks5uTui9gaJpZM4WJ1uJ .

-- Dan Schrider Assistant Professor Department of Genetics University of North Carolina at Chapel Hill email: drs@unc.edu phone: (919) 966-1764 website: https://www.schriderlab.org/