Closed Raphaaal closed 1 year ago
Thanks for this great summary and code example. I don't have a multi-gpu setup that I can test this, with a single GPU I couldn't reproduce the error. But I have a suspicion that it could be due to caching. Could you please check if turning off caching solves the issue for you? To do that, initialize your net like this:
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
callbacks__valid_acc__use_caching=False, # <= added line
)
Of course, if you have more scoring callbacks than the default ones, turn off caching for those too.
If that doesn't help, please test disabling callbacks completely, using:
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
callbacks="disable",
)
Please report your findings back.
Thanks a lot for the great tool!
Always happy to hear that :)
Thanks for the quick reply.
Unfortunately it did not solve the issue. However, the error trace changed.
Please find below the various traces (I increased the RandomizedSearchCV verbosity with verbose=3
).
Turning off caching
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
callbacks__valid_acc__use_caching=False, # <= added line
)
Error:
The following values were not passed to accelerate launch and had defaults used instead:
dynamo_backend was set to a value of no
To avoid this warning pass in values for each of the problematic parameters or run accelerate config
Fitting 2 folds for each of 10 candidates, totalling 20
Fitting 2 folds for each of 10 candidates, totalling 20
[CV 1/2] END .............batch_size=30, lr=0.001;, score=nan total time= 4.5s
[CV 2/2] END .............batch_size=30, lr=0.001;, score=nan total time= 0.0s
As you can see, the score is still NaN. At this point execution freezes and the RandomizedSearchCV fit does not terminate. Note that the second fold fit time is 0.0s.
Disabling callbacks
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
callbacks="disable", # <= added line
)
Error:
The following values were not passed to accelerate launch and had defaults used instead
dynamo_backend was set to a value of
To avoid this warning pass in values for each of the problematic parameters or run accelerate config
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Fitting 2 folds for each of 10 candidates, totalling 20 fits
/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:778: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in call
return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_averageprecision
precision, recall, = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in
check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500] warnings.warn( [CV 1/2] END ..............batch_size=20, lr=0.01;, score=nan total time= 5.8s [CV 2/2] END ..............batch_size=20, lr=0.01;, score=nan total time= 0.0s
As you can see, the score is still NaN. At this point execution freezes and the RandomizedSearchCV fit does not terminate. Note that the second fold fit time is 0.0s.
- No change (`verbose=3`)
```python
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
)
Error:
The following values were not passed to accelerate launch and had defaults used instead:
dynamo_backend was set to a value of no
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Fitting 2 folds for each of 10 candidates, totalling 20 fits
epoch train_loss valid_acc valid_loss dur
------------ ----------- ------------ ------
1 -0.4437 0.8000 -1.6170 0.1552
2 -2.1335 0.8180 -2.1231 0.1443
3 -0.2566 0.5800 2.0079 0.1521
4 1.5909 0.5840 1.9191 0.1305
5 1.5185 0.5880 1.7096 0.1742
6 1.3397 0.5960 1.5191 0.1502
7 1.1890 0.6080 1.3841 0.1607
8 1.0627 0.6080 1.3157 0.1336
9 0.9933 0.6080 1.2165 0.1481
10 0.9050 0.6120 1.1093 0.1277
/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:778: UserWarning:
Scoring failed. The score on this train-test partition for these parameters will be set to nan.
Details:
Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]
warnings.warn(
[CV 1/2] END ...............batch_size=20, lr=0.1;, score=nan total time= 6.0s
As you can see, the score is NaN but only the first fold completed. At this point execution freezes and the RandomizedSearchCV fit does not terminate.
Thanks a lot in advance for your guidance
Hmm, this does not look good.
Whether the search fails early or works a while and fails later is probably not related to the specific conditions you posted but is caused by some combination of random hyper-parameters; since RandomizedSearchCV
is not seeded, sometimes that combination occurs earlier, sometimes later. And the score being nan
could be because the output non-linearity is not correct for a classification task (relu should be softmax). Still, we shouldn't see that ValueError
.
I'm sorry that I have to ask you to try a few more things, but as mentioned I cannot replicate this locally:
train_split=False
to the net to do so.RandomizedSearchCV
but just fit the net directly, initializing it with the different batch sizes you tested. Is it possible to trigger the error consistently with a specific batch size?Thanks again for the reply.
I implemented your suggestions (reproducibility by setting seeds & SoftMax). I also changed the default error_score=nan
to error_score="raise"
in the RandomizedSearchCV
because I suspected the nan
to come from the scoring error.
I also fitted the net without RandomizedSearchCV
using the four batch_size
used in the example, without any problem with Accelerate.
Full code:
import torch
import numpy as np
from skorch import NeuralNetClassifier
from skorch.hf import AccelerateMixin
from accelerate import Accelerator
from sklearn.datasets import make_classification
from sklearn.model_selection import RandomizedSearchCV
import torch.nn as nn
import random
# FYI: Accelerate also requires the `transformers` packages from HuggingFace
# Reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
# Generate data
X, y = make_classification(10_000, 100, n_informative=5, random_state=0)
X = X.astype(np.float32)
y = y.astype(np.int64)
# PyTorch module
class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.dense0 = nn.Linear(100, 2)
self.nonlin = nn.Softmax(dim=-1)
def forward(self, X):
X = self.dense0(X)
X = self.nonlin(X)
return X
# Skorch wrapper
class AcceleratedNeuralNetClassifier(
AccelerateMixin,
NeuralNetClassifier
):
"""NeuralNetClassifier with HuggingFace Accelerate support"""
accelerator = Accelerator()
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
)
# HPO
rs = RandomizedSearchCV(
estimator=model,
param_distributions={
"lr": [0.0001, 0.001, 0.01, 0.1],
"batch_size": [10, 20, 30, 40],
},
n_iter=10,
scoring="average_precision",
n_jobs=1,
refit=False,
cv=2,
verbose=3,
random_state=SEED,
error_score="raise"
)
rs.fit(X, y)
print(f"{rs.cv_results_}")
torch.distributed
because the executions finishes.The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Fitting 2 folds for each of 10 candidates, totalling 20 fits
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.8021 0.5000 0.7604 0.2809
2 0.7660 0.5240 0.7290 0.2804
3 0.7338 0.5560 0.7011 0.2634
4 0.7052 0.5820 0.6765 0.2743
5 0.6798 0.6200 0.6546 0.2696
6 0.6571 0.6400 0.6352 0.2517
7 0.6369 0.6520 0.6178 0.2470
8 0.6187 0.6660 0.6024 0.2482
9 0.6024 0.6880 0.5885 0.2399
10 0.5876 0.6920 0.5760 0.2459
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 69, in <module>
rs.fit(X, y)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 874, in fit
self._run_search(evaluate_candidates)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1768, in _run_search
evaluate_candidates(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 821, in evaluate_candidates
out = parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
return self.function(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 69, in <module>
rs.fit(X, y)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 874, in fit
self._run_search(evaluate_candidates)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1768, in _run_search
evaluate_candidates(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 821, in evaluate_candidates
out = parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
return self.function(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 45560) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
skorch_accelerate_issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-03-28_16:23:17
host : ML
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 45561)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-28_16:23:17
host : ML
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 45560)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
train_split=False
Same trace:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Fitting 2 folds for each of 10 candidates, totalling 20 fits
epoch train_loss dur
------- ------------ ------
1 0.7932 0.3126
2 0.7503 0.2607
3 0.7133 0.2838
4 0.6814 0.2720
5 0.6537 0.2740
6 0.6296 0.2714
7 0.6086 0.2735
8 0.5902 0.2781
9 0.5740 0.2641
10 0.5597 0.2656
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 69, in <module>
rs.fit(X, y)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 874, in fit
self._run_search(evaluate_candidates)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1768, in _run_search
evaluate_candidates(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 821, in evaluate_candidates
out = parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
return self.function(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 69, in <module>
rs.fit(X, y)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 874, in fit
self._run_search(evaluate_candidates)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1768, in _run_search
evaluate_candidates(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 821, in evaluate_candidates
out = parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
return self.function(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 47975) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
skorch_accelerate_issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-03-28_16:25:23
host : ML
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 47976)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-28_16:25:23
host : ML
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 47975)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Turning off caching Same trace
Disabling callbacks Same trace
All-at-once (callback disabled + caching turned off + train split disabled) Same trace
Thanks a lot in advance for your time.
Thanks for your detailed experiments. IIUC, all the conditions work, except for using accelerate together with RandomizedSearchCV
(I assume it's the same for GridSearchCV
etc.). This narrows down the possibilities, but I still don't see why one would affect the other.
Could you please do some more tests:
# check cross_validate
from sklearn.model_selection import cross_validate
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
)
cross_validate(model, X, y)
# check cloning
from sklearn.base import clone
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
# also test with different hyper-parameter settings, esp. batch size
)
model_cloned = clone(model)
model_cloned.fit(X, y)
# checking joblib
from joblib import parallel_backend
backend = 'loky' # also test 'threading' and 'multiprocessing'
with parallel_backend(backend, n_jobs=1):
model = ... # check different hyper parames
model.fit(X, y)
Can any of those conditions reproduce the error?
I suspect it could be a weird interaction with joblib. I could ask the accelerate devs if they have ever seen anything like this. To do so, could you please give detailed info about your environment (hardware, OS, versions of all packages, Python, etc.)?
Thanks for your reply.
IIUC, all the conditions work, except for using accelerate together with RandomizedSearchCV
Indeed.
Could you please do some more tests:
from sklearn.model_selection import cross_validate
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
)
cross_validate(
model, X, y,
cv=2, scoring="average_precision", error_score="raise"
)
Interestingly, it reproduces (almost) the same error. I am saying almost because the "inconsistent numbers of samples" are slightly different ([5000, 2560]
versus [5000, 2500]
)
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.6996 0.6738 0.6160 0.0532
2 0.5616 0.7441 0.5305 0.0325
3 0.4985 0.7637 0.4866 0.0325
4 0.4641 0.7891 0.4606 0.0353
5 0.4428 0.8047 0.4437 0.0403
6 0.4286 0.8105 0.4320 0.0339
7 0.4184 0.8105 0.4234 0.0345
8 0.4108 0.8105 0.4169 0.0342
9 0.4050 0.8203 0.4118 0.0344
10 0.4004 0.8262 0.4078 0.0334
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 68, in <module>
cross_validate(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
results = parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 68, in <module>
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
cross_validate(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
results = parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
return self.function(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2560]
return self.function(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2560]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 46440) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
skorch_accelerate_issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-03-29_14:11:45
host : ML
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 46441)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-29_14:11:45
host : ML
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 46440)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
from sklearn.base import clone
for b_size in [10, 20, 30, 40]:
accelerator = Accelerator()
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
batch_size=b_size
)
model_cloned = clone(model)
model_cloned.fit(X, y)
Training OK.
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4272 0.8470 0.3837 0.5196
2 0.3804 0.8550 0.3810 0.5237
3 0.3778 0.8530 0.3815 0.5245
4 0.3772 0.8510 0.3819 0.5087
5 0.3770 0.8500 0.3821 0.4756
6 0.3769 0.8510 0.3822 0.4767
7 0.3769 0.8510 0.3823 0.5115
8 0.3769 0.8510 0.3823 0.4704
9 0.3768 0.8510 0.3823 0.5193
10 0.3768 0.8510 0.3823 0.4821
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4801 0.8390 0.4085 0.2739
2 0.3975 0.8440 0.3992 0.2672
3 0.3896 0.8400 0.3986 0.2590
4 0.3871 0.8370 0.3991 0.2563
5 0.3862 0.8370 0.3998 0.2627
6 0.3858 0.8380 0.4003 0.2649
7 0.3856 0.8370 0.4007 0.2629
8 0.3855 0.8370 0.4009 0.2542
9 0.3855 0.8370 0.4011 0.2465
10 0.3854 0.8370 0.4013 0.2557
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4989 0.8363 0.4242 0.2241
2 0.3972 0.8451 0.4008 0.1792
3 0.3843 0.8500 0.3941 0.1748
4 0.3799 0.8510 0.3916 0.1724
5 0.3779 0.8539 0.3905 0.1749
6 0.3769 0.8529 0.3900 0.1722
7 0.3764 0.8520 0.3898 0.1840
8 0.3761 0.8510 0.3898 0.1778
9 0.3760 0.8500 0.3899 0.1813
10 0.3759 0.8510 0.3899 0.1816
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.5122 0.8130 0.4207 0.1464
2 0.4112 0.8300 0.3918 0.1458
3 0.3964 0.8370 0.3836 0.1447
4 0.3915 0.8410 0.3804 0.1391
5 0.3893 0.8420 0.3791 0.1361
6 0.3883 0.8400 0.3785 0.1410
7 0.3878 0.8430 0.3783 0.1393
8 0.3875 0.8440 0.3783 0.1383
9 0.3874 0.8430 0.3783 0.1373
10 0.3874 0.8460 0.3784 0.1355
from joblib import parallel_backend
for backend in ['loky', 'threading', 'multiprocessing']:
print(f"\nUsing backend {backend}")
with parallel_backend(backend, n_jobs=1):
for b_size in [10, 20, 30, 40]:
accelerator = Accelerator()
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
batch_size=b_size
)
model.fit(X, y)
Training OK
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Using backend loky
Using backend loky
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4272 0.8470 0.3837 0.5170
2 0.3804 0.8550 0.3810 0.5128
3 0.3778 0.8530 0.3815 0.5058
4 0.3772 0.8510 0.3819 0.5218
5 0.3770 0.8500 0.3821 0.5017
6 0.3769 0.8510 0.3822 0.4915
7 0.3769 0.8510 0.3823 0.5379
8 0.3769 0.8510 0.3823 0.5746
9 0.3768 0.8510 0.3823 0.5320
10 0.3768 0.8510 0.3823 0.5005
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4801 0.8390 0.4085 0.2769
2 0.3975 0.8440 0.3992 0.2822
3 0.3896 0.8400 0.3986 0.2683
4 0.3871 0.8370 0.3991 0.2673
5 0.3862 0.8370 0.3998 0.2607
6 0.3858 0.8380 0.4003 0.3066
7 0.3856 0.8370 0.4007 0.3167
8 0.3855 0.8370 0.4009 0.2673
9 0.3855 0.8370 0.4011 0.2614
10 0.3854 0.8370 0.4013 0.2647
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4989 0.8363 0.4242 0.1936
2 0.3972 0.8451 0.4008 0.1856
3 0.3843 0.8500 0.3941 0.1895
4 0.3799 0.8510 0.3916 0.2117
5 0.3779 0.8539 0.3905 0.2053
6 0.3769 0.8529 0.3900 0.1819
7 0.3764 0.8520 0.3898 0.1928
8 0.3761 0.8510 0.3898 0.1832
9 0.3760 0.8500 0.3899 0.1861
10 0.3759 0.8510 0.3899 0.1942
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.5122 0.8130 0.4207 0.1546
2 0.4112 0.8300 0.3918 0.1588
3 0.3964 0.8370 0.3836 0.1705
4 0.3915 0.8410 0.3804 0.1777
5 0.3893 0.8420 0.3791 0.1431
6 0.3883 0.8400 0.3785 0.1561
7 0.3878 0.8430 0.3783 0.1457
8 0.3875 0.8440 0.3783 0.1614
9 0.3874 0.8430 0.3783 0.1470
10 0.3874 0.8460 0.3784 0.1527
Using backend threading
Using backend threading
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4261 0.8450 0.3846 0.5254
2 0.3805 0.8500 0.3816 0.5058
3 0.3777 0.8480 0.3817 0.5006
4 0.3771 0.8490 0.3820 0.5077
5 0.3769 0.8500 0.3822 0.5350
6 0.3769 0.8510 0.3822 0.5379
7 0.3768 0.8510 0.3822 0.5068
8 0.3768 0.8510 0.3823 0.5018
9 0.3768 0.8510 0.3823 0.5917
10 0.3768 0.8510 0.3823 0.5405
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4615 0.8300 0.4188 0.3085
2 0.3944 0.8380 0.4059 0.2704
3 0.3880 0.8420 0.4028 0.2707
4 0.3863 0.8400 0.4019 0.3165
5 0.3857 0.8380 0.4016 0.2664
6 0.3855 0.8370 0.4015 0.2624
7 0.3854 0.8380 0.4015 0.2673
8 0.3854 0.8390 0.4015 0.2867
9 0.3854 0.8390 0.4015 0.2866
10 0.3854 0.8380 0.4015 0.2755
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4711 0.8284 0.4159 0.1916
2 0.3936 0.8373 0.3964 0.2079
3 0.3829 0.8422 0.3915 0.2168
4 0.3793 0.8451 0.3900 0.1939
5 0.3777 0.8490 0.3897 0.1833
6 0.3769 0.8490 0.3897 0.1863
7 0.3765 0.8500 0.3898 0.2119
8 0.3763 0.8500 0.3899 0.2143
9 0.3762 0.8490 0.3901 0.2452
10 0.3761 0.8500 0.3902 0.1894
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.5508 0.8100 0.4390 0.1565
2 0.4187 0.8380 0.4014 0.1456
3 0.4004 0.8420 0.3900 0.1927
4 0.3939 0.8440 0.3850 0.1590
5 0.3909 0.8480 0.3825 0.1454
6 0.3893 0.8470 0.3811 0.1493
7 0.3885 0.8460 0.3802 0.1676
8 0.3880 0.8440 0.3798 0.1418
9 0.3877 0.8430 0.3795 0.1457
10 0.3875 0.8430 0.3793 0.1608
Using backend multiprocessing
Using backend multiprocessing
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4196 0.8480 0.3868 0.5061
2 0.3805 0.8520 0.3827 0.5080
3 0.3780 0.8510 0.3825 0.5078
4 0.3774 0.8500 0.3825 0.5109
5 0.3771 0.8490 0.3825 0.5234
6 0.3770 0.8490 0.3825 0.5452
7 0.3770 0.8480 0.3825 0.5594
8 0.3769 0.8500 0.3824 0.5044
9 0.3769 0.8500 0.3824 0.5021
10 0.3769 0.8510 0.3824 0.5518
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4627 0.8330 0.4141 0.2845
2 0.3969 0.8390 0.4032 0.2606
3 0.3895 0.8360 0.4011 0.3675
4 0.3871 0.8350 0.4007 0.2591
5 0.3862 0.8370 0.4007 0.2517
6 0.3858 0.8360 0.4009 0.2732
7 0.3856 0.8370 0.4011 0.2692
8 0.3855 0.8360 0.4012 0.2773
9 0.3855 0.8370 0.4013 0.2745
10 0.3855 0.8370 0.4014 0.2972
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.4894 0.8353 0.4170 0.2135
2 0.3962 0.8520 0.3967 0.2107
3 0.3844 0.8529 0.3915 0.1853
4 0.3802 0.8510 0.3899 0.2013
5 0.3783 0.8500 0.3894 0.1808
6 0.3773 0.8500 0.3893 0.1910
7 0.3767 0.8500 0.3894 0.1899
8 0.3764 0.8510 0.3896 0.1952
9 0.3762 0.8500 0.3898 0.2132
10 0.3761 0.8500 0.3899 0.2240
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.5386 0.8200 0.4245 0.1483
2 0.4170 0.8360 0.3927 0.1550
3 0.3999 0.8420 0.3836 0.1409
4 0.3938 0.8430 0.3801 0.1403
5 0.3910 0.8410 0.3785 0.1733
6 0.3895 0.8410 0.3779 0.1465
7 0.3886 0.8470 0.3777 0.1506
8 0.3882 0.8450 0.3777 0.1417
9 0.3879 0.8450 0.3778 0.1496
10 0.3877 0.8460 0.3779 0.1471
could you please give detailed info about your environment (hardware, OS, versions of all packages, Python, etc.)?
- GPUs: Tesla V100-PCIE-16GB
- NVIDIA-SMI: 515.43.04
- Driver Version: 515.43.04
- CUDA Version: 11.7
- OS: Ubuntu 18.04.3 LTS
- Python: 3.9.15
- All packages:
Name | Version |
---|---|
_ipython_minor_entry_point | 8.7.0 |
_libgcc_mutex | 0.1 |
_openmp_mutex | 5.1 |
absl-py | 1.3.0 |
accelerate | 0.17.0 |
alembic | 1.8.1 |
anyio | 3.6.2 |
argon2-cffi | 21.3.0 |
argon2-cffi-bindings | 21.2.0 |
asttokens | 2.2.1 |
astunparse | 1.6.3 |
attrs | 22.1.0 |
autopage | 0.5.1 |
babel | 2.11.0 |
backcall | 0.2.0 |
backports | 1 |
backports.functools_lru_cache | 1.6.4 |
beautifulsoup4 | 4.11.1 |
bleach | 5.0.1 |
brotlipy | 0.7.0 |
ca-certificates | 2023.01.10 |
cachetools | 5.2.0 |
captum | 0.5.0 |
catboost | 1.0.6 |
category-encoders | 2.4.0 |
certifi | 2022.12.7 |
cffi | 1.15.0 |
chardet | 4.0.0 |
charset-normalizer | 2.1.1 |
click | 7 |
cliff | 4.1.0 |
cloudpickle | 2.2.0 |
cmaes | 0.9.0 |
cmd2 | 2.4.2 |
colorama | 0.4.6 |
colorlog | 6.7.0 |
configargparse | 1.5.3 |
cryptography | 3.4.8 |
cycler | 0.11.0 |
decorator | 5.1.1 |
defusedxml | 0.7.1 |
docker-pycreds | 0.4.0 |
einops | 0.6.0 |
entrypoints | 0.4 |
executing | 1.2.0 |
filelock | 3.9.0 |
flatbuffers | 1.12 |
flit-core | 3.8.0 |
gast | 0.4.0 |
gitdb | 4.0.10 |
gitpython | 3.1.29 |
google-auth | 2.15.0 |
google-auth-oauthlib | 0.4.6 |
google-pasta | 0.2.0 |
greenlet | 2.0.1 |
grpcio | 1.51.1 |
h5py | 3.7.0 |
huggingface-hub | 0.13.1 |
icecream | 2.1.1 |
idna | 2.1 |
importlib-metadata | 5.1.0 |
importlib_resources | 5.10.1 |
iniconfig | 1.1.1 |
ipykernel | 5.5.5 |
ipython | 8.7.0 |
ipython_genutils | 0.2.0 |
jedi | 0.18.2 |
jinja2 | 3.1.2 |
joblib | 1.2.0 |
json5 | 0.9.5 |
jsonschema | 4.17.3 |
jupyter_client | 7.0.6 |
jupyter_core | 5.1.0 |
jupyter_server | 1.23.3 |
jupyterlab | 3.5.1 |
jupyterlab_pygments | 0.2.2 |
jupyterlab_server | 2.16.5 |
keras | 2.9.0 |
keras-preprocessing | 1.1.2 |
kiwisolver | 1.4.4 |
ld_impl_linux-64 | 2.38 |
liac-arff | 2.5.0 |
libclang | 14.0.6 |
libffi | 3.4.2 |
libgcc-ng | 11.2.0 |
libgomp | 11.2.0 |
libsodium | 1.0.18 |
libstdcxx-ng | 11.2.0 |
lightgbm | 3.2.1 |
llvmlite | 0.38.1 |
mako | 1.2.4 |
markdown | 3.4.1 |
markupsafe | 2.1.1 |
matplotlib | 3.4.2 |
matplotlib-inline | 0.1.6 |
minio | 7.1.12 |
mistune | 2.0.4 |
nbclassic | 0.4.8 |
nbclient | 0.7.2 |
nbconvert | 7.2.6 |
nbconvert-core | 7.2.6 |
nbconvert-pandoc | 7.2.6 |
nbformat | 5.7.0 |
ncurses | 6.3 |
nest-asyncio | 1.5.6 |
notebook | 6.5.2 |
notebook-shim | 0.2.2 |
numba | 0.55.1 |
numpy | 1.20.3 |
nvidia-ml-py3 | 7.352.0 |
oauthlib | 3.2.2 |
openml | 0.12.2 |
openssl | 1.1.1s |
opt-einsum | 3.3.0 |
optuna | 2.10.0 |
packaging | 22 |
pandas | 1.5.3 |
pandoc | 2.19.2 |
pandocfilters | 1.5.0 |
parso | 0.8.3 |
pathtools | 0.1.2 |
patsy | 0.5.3 |
pbr | 5.11.0 |
pexpect | 4.8.0 |
pickleshare | 0.7.5 |
pillow | 9.3.0 |
pip | 22.3.1 |
pkgutil-resolve-name | 1.3.10 |
platformdirs | 2.6.0 |
plotly | 5.10.0 |
pluggy | 1.0.0 |
ply | 3.11 |
prettytable | 3.5.0 |
progressbar2 | 4.2.0 |
prometheus_client | 0.15.0 |
promise | 2.3 |
prompt-toolkit | 3.0.36 |
protobuf | 3.19.6 |
psutil | 5.9.4 |
ptyprocess | 0.7.0 |
pure_eval | 0.2.2 |
py | 1.11.0 |
pyarrow | 10.0.1 |
pyasn1 | 0.4.8 |
pyasn1-modules | 0.2.8 |
pycparser | 2.21 |
pygments | 2.13.0 |
pynvml | 11.4.1 |
pyopenssl | 20.0.1 |
pyparsing | 3.0.9 |
pyperclip | 1.8.2 |
pyrsistent | 0.18.0 |
pysocks | 1.7.1 |
pytest | 7.1.2 |
python | 3.9.15 |
python-dateutil | 2.8.2 |
python-fastjsonschema | 2.16.2 |
python-graphviz | 0.20.1 |
python-utils | 3.4.5 |
python_abi | 3.9 |
pytomlpp | 1.0.10 |
pytz | 2022.6 |
pyyaml | 6 |
pyzmq | 19.0.2 |
quantiphy | 2.18.0 |
readline | 8.2 |
regex | 2022.10.31 |
requests | 2.25.1 |
requests-oauthlib | 1.3.1 |
rotation-forest | 1 |
rsa | 4.9 |
scikit-learn | 1.2.2 |
scipy | 1.6.2 |
send2trash | 1.8.0 |
sentry-sdk | 1.11.1 |
setproctitle | 1.3.2 |
setuptools | 65.5.0 |
setuptools-scm | 7.0.5 |
shap | 0.39.0 |
shortuuid | 1.0.11 |
six | 1.16.0 |
skorch | 0.12.1 |
slicer | 0.0.7 |
smmap | 5.0.0 |
sniffio | 1.3.0 |
soupsieve | 2.3.2.post1 |
sqlalchemy | 1.4.45 |
sqlite | 3.40.0 |
stack_data | 0.6.2 |
statsmodels | 0.13.5 |
stevedore | 4.1.1 |
tabulardl | 0.1.0 |
tabulate | 0.9.0 |
tenacity | 8.1.0 |
tensorboard | 2.9.1 |
tensorboard-data-server | 0.6.1 |
tensorboard-plugin-wit | 1.8.1 |
tensorboardx | 2.6 |
tensorflow | 2.9.1 |
tensorflow-estimator | 2.9.0 |
tensorflow-io-gcs-filesystem | 0.28.0 |
termcolor | 2.1.1 |
termcolor-whl | 1.1.2 |
terminado | 0.17.1 |
threadpoolctl | 3.1.0 |
tinycss2 | 1.2.1 |
tk | 8.6.12 |
tokenizers | 0.13.2 |
tomli | 2.0.1 |
torch | 1.10.1 |
torch-summary | 1.4.5 |
tornado | 6.1 |
tqdm | 4.62.3 |
traitlets | 5.7.1 |
transformers | 4.26.1 |
typing_extensions | 4.4.0 |
tzdata | 2022g |
urllib3 | 1.26.13 |
wandb | 0.12.11 |
wcwidth | 0.2.5 |
webencodings | 0.5.1 |
websocket-client | 1.4.2 |
werkzeug | 2.2.2 |
wheel | 0.37.1 |
wrapt | 1.14.1 |
xgboost | 1.7.4 |
xmltodict | 0.13.0 |
xz | 5.2.8 |
yaspin | 2.2.0 |
zero | 0.9.1 |
zeromq | 4.3.4 |
zipp | 3.11.0 |
zlib | 1.2.13 |
Many thanks in advance.
Great, I'm asking colleagues, let's see if anything comes up.
Meanwhile, two more things to test:
Probably you already tested that, but using, say, scoring="accuracy"
, still gives the same error, right?
The issue is almost certainly related to the two GPUs, since the same code runs fine with 1 GPU. Also, we have 10000 samples, with cv=2
, we get 5000, which the error suggests is what we expect, but for some reason we get 2500 or 2560, which is (roughly) half of 5000, as if the predictions were split between the GPUs. Still, to validate this, could you please check that indicating a single GPU makes the error disappear?
accelerator = Accelerator(device_placement=False)
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
device='cuda:0', # or 'cuda:1'
)
cross_validate(
model,
X,
y,
cv=2,
error_score="raise"
)
No progress yet but something more to test:
import copy
import torch
import torch.nn as nn
from accelerate import Accelerator
from sklearn.model_selection import KFold
class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.dense0 = nn.Linear(100, 2)
self.nonlin = nn.LogSoftmax(dim=-1)
def forward(self, X):
X = self.dense0(X)
X = self.nonlin(X)
return X
X = torch.rand((10000, 100))
y = torch.randint(0, 2, size=(10000,))
model = MyModule()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
accelerator = Accelerator()
def accuracy(y_true, y_pred):
assert len(y_true) == len(y_pred)
return (y_true.cpu() == y_pred.cpu()).float().mean().item()
def _fit_and_score(model, accelerator, X_train, y_train, X_test, y_test, max_epochs=10):
model = copy.deepcopy(model)
accelerator = copy.deepcopy(accelerator)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
dataset_train = torch.utils.data.TensorDataset(X_train, y_train)
dataloader_train = torch.utils.data.DataLoader(dataset_train, batch_size=10)
dataset_test = torch.utils.data.TensorDataset(X_test, y_test)
dataloader_test = torch.utils.data.DataLoader(dataset_test, batch_size=10)
model, optimizer = accelerator.prepare(model, optimizer)
dataloader_train, dataloader_test = accelerator.prepare(dataloader_train, dataloader_test)
# training
model.train()
for epoch in range(max_epochs):
for source, targets in dataloader_train:
optimizer.zero_grad()
output = model(source)
loss = nn.functional.nll_loss(output, targets)
accelerator.backward(loss)
optimizer.step()
# validation
model.eval()
y_proba = []
losses = []
for source, targets in dataloader_test:
output = model(source)
loss = nn.functional.nll_loss(output, targets)
y_proba.append(output)
losses.append(loss)
print(len(y_proba), {len(batch) for batch in y_proba})
y_proba = torch.vstack(y_proba)
y_pred = y_proba.argmax(1)
print("test loss", (sum(losses) / len(losses)).item())
print("accuracy:", accuracy(y_test, y_pred))
# training without joblib
for idx_train, idx_test in KFold(2).split(X, y):
X_train, y_train = X[idx_train], y[idx_train]
X_test, y_test = X[idx_test], y[idx_test]
_fit_and_score(model, accelerator, X_train, y_train, X_test, y_test)
# training with joblib
from joblib import Parallel, delayed
parallel = Parallel(n_jobs=None, verbose=0, pre_dispatch='2*n_jobs')
parallel(
delayed(_fit_and_score)(
model,
accelerator,
X[idx_train], y[idx_train],
X[idx_test], y[idx_test],
)
for idx_train, idx_test in KFold(2).split(X, y)
)
# training with sklearn joblib
from sklearn.utils.parallel import Parallel, delayed
parallel = Parallel(n_jobs=None, verbose=0, pre_dispatch='2*n_jobs')
parallel(
delayed(_fit_and_score)(
model,
accelerator,
X[idx_train], y[idx_train],
X[idx_test], y[idx_test],
)
for idx_train, idx_test in KFold(2).split(X, y)
)
The idea here is to try to remove as much "fluff" as possible in order to isolate the problem. So skorch is completely removed, and from cross_validate
, I tried to only take the essential parts.
Thanks for your reply.
using, say, scoring="accuracy", still gives the same error, right?
Yes.
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]
could you please check that indicating a single GPU makes the error disappear?
accelerator = Accelerator(device_placement=False)
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
device='cuda:0',
)
cross_validate(
model,
X,
y,
cv=2,
error_score="raise"
)
I get a new error:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 70, in <module>
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 70, in <module>
cross_validate(
cross_validate(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
results = parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
results = parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
return [func(*args, **kwargs)return self.function(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
return self.function(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/classifier.py", line 141, in fit
estimator.fit(X_train, y_train, **fit_params)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/classifier.py", line 141, in fit
return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/net.py", line 1228, in fit
return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/net.py", line 1228, in fit
self.initialize()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/net.py", line 815, in initialize
self.initialize()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/net.py", line 815, in initialize
self._initialize_module()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/hf.py", line 948, in _initialize_module
self._initialize_module()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/hf.py", line 948, in _initialize_module
setattr(self, name + '_', self.accelerator.prepare(module))
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1094, in prepare
setattr(self, name + '_', self.accelerator.prepare(module))
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1094, in prepare
result = tuple(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1095, in <genexpr>
result = tuple(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1095, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 949, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 949, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1166, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1166, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
model = torch.nn.parallel.DistributedDataParallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
dist._verify_model_across_ranks(self.process_group, parameters)
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24510) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
skorch_accelerate_issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-03-29_16:42:35
host : ML
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 24511)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-29_16:42:35
host : ML
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 24510)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Also, this launches two processes on the same GPU, as if constraining device='cuda:'0
messes with what Accelerate was configured to do (i.e., train on two GPUs).
accelerator = Accelerator(device_placement=False)
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
device='cuda:0',
)
cross_validate(
model,
X,
y,
cv=2,
error_score="raise"
)
Training OK. Note that this launches a single process on the GPU.
ValueError: Found input variables with inconsistent numbers of samples: [5000, 1670]
something more to test
I reply in the next comment.
something more to test
# training without joblib
for idx_train, idx_test in KFold(2).split(X, y):
X_train, y_train = X[idx_train], y[idx_train]
X_test, y_test = X[idx_test], y[idx_test]
_fit_and_score(model, accelerator, X_train, y_train, X_test, y_test)
Error is:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
250 {10}
250 {10}
test loss 0.6986234784126282
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 80, in <module>
_fit_and_score(model, accelerator, X_train, y_train, X_test, y_test)
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
print("accuracy:", accuracy(y_test, y_pred))
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
assert len(y_true) == len(y_pred)
AssertionError
test loss 0.6999748349189758
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 80, in <module>
_fit_and_score(model, accelerator, X_train, y_train, X_test, y_test)
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
print("accuracy:", accuracy(y_test, y_pred))
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
assert len(y_true) == len(y_pred)
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 43418) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
skorch_accelerate_issue_wofluff.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-03-29_17:03:54
host : ML
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 43419)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-29_17:03:54
host : ML
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 43418)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
FYI len(y_true)=5000
and len(y_pred)=2500
# training with joblib
from joblib import Parallel, delayed
parallel = Parallel(n_jobs=None, verbose=0, pre_dispatch='2*n_jobs')
parallel(
delayed(_fit_and_score)(
model,
accelerator,
X[idx_train], y[idx_train],
X[idx_test], y[idx_test],
)
for idx_train, idx_test in KFold(2).split(X, y)
)
Error is the same:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
250 {10}
test loss 0.6977217793464661
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 85, in <module>
parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
print("accuracy:", accuracy(y_test, y_pred))
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
assert len(y_true) == len(y_pred)
AssertionError
250 {10}
test loss 0.6986113786697388
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 85, in <module>
parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
print("accuracy:", accuracy(y_test, y_pred))
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
assert len(y_true) == len(y_pred)
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 45374) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
skorch_accelerate_issue_wofluff.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-03-29_17:06:27
host : ML
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 45375)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-29_17:06:27
host : ML
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 45374)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
FYI len(y_true)=5000
and len(y_pred)=2500
# training with sklearn joblib
from sklearn.utils.parallel import Parallel, delayed
parallel = Parallel(n_jobs=None, verbose=0, pre_dispatch='2*n_jobs')
parallel(
delayed(_fit_and_score)(
model,
accelerator,
X[idx_train], y[idx_train],
X[idx_test], y[idx_test],
)
for idx_train, idx_test in KFold(2).split(X, y)
)
Error is the same:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
250 {10}
test loss 0.6967058181762695
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 98, in <module>
parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
250 {10}
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
return self.function(*args, **kwargs)
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
print("accuracy:", accuracy(y_test, y_pred))
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
assert len(y_true) == len(y_pred)
AssertionError
test loss 0.697378396987915
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 98, in <module>
parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
return self.function(*args, **kwargs)
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
print("accuracy:", accuracy(y_test, y_pred))
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
assert len(y_true) == len(y_pred)
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 46860) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
skorch_accelerate_issue_wofluff.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-03-29_17:08:10
host : ML
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 46861)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-29_17:08:10
host : ML
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 46860)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
FYI len(y_true)=5000
and len(y_pred)=2500
Finally, when I change your assert
for
assert len(y_true.cpu()) == len(y_pred.cpu())
I still get the same AssertionError
in all three cases.
Many thanks in advance for your feedback.
Thanks again, this is really helpful. Especially, since the first example without joblib already fails, that can't be the reason. This prompted me to look a bit more into the accelerate docs and I would like to test one more thing (sorry for the back and forth), name calling gather
explicitly, like described here:
https://huggingface.co/docs/accelerate/quicktour#distributed-evaluation
So IIUC, that means that in the evaluation part of _fit_and_score
, you need to add output = accelerator.gather_for_metrics(output)
after output = model(source)
.
In case this solves the issue, I would consider it a skorch bug. To quickly try a fix, you would need to subclass AccelerateMixin
and add the following method:
class MyAccelerateMixin(AccelerateMixin):
def evaluation_step(self, batch, training=False):
output = super().evaluation_step(batch, training=training)
return self.accelerator.gather_for_metrics(output)
(or add this method AcceleratedNeuralNetClassifier
)
This would be more of a quick and dirty hack, I would need to investigate further how to do this most efficiently. So if it works, do check that your code actually runs faster with accelerate than without.
Good news!
With a slight adaptation, gather_for_metrics
can indeed solve the issue when using the _fit_and_score
without any fluff.
for source, targets in dataloader_test:
outputs = model(source)
# outputs = accelerator.gather_for_metrics(outputs) <= initial suggestion
all_outputs, all_targets = accelerator.gather_for_metrics((outputs, targets)) # <= corrected
loss = nn.functional.nll_loss(all_outputs, all_targets)
y_proba.append(all_outputs)
losses.append(loss)
Output:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
250 {20}
250 {20}
test loss 0.6963825225830078
test loss 0.6963825225830078
accuracy: 0.5130000114440918
accuracy: 0.5016000270843506
250 {20}
250 {20}
test loss 0.6963870525360107
accuracy: 0.4991999864578247
test loss 0.6963870525360107
accuracy: 0.5005999803543091
250 {20}
250 {20}
test loss 0.6963825225830078
accuracy: 0.5130000114440918
test loss 0.6963825225830078
accuracy: 0.5016000270843506
250 {20}250
{20}
test loss 0.6963870525360107
test loss 0.6963870525360107
accuracy: 0.4991999864578247
accuracy: 0.5005999803543091
250 {20}
250 {20}
test loss 0.6963825225830078
accuracy: 0.5130000114440918
test loss 0.6963825225830078
accuracy: 0.5016000270843506
250 {20}
250 {20}
test loss 0.6963870525360107
test loss 0.6963870525360107
accuracy: 0.4991999864578247
accuracy: 0.5005999803543091
Note that gather_for_metrics
is only needed in the eval phase and not in the training phase.
I also tried the Skorch adaptation you mentioned, but I think I am incorrectly implementing it. Full code:
import torch
import numpy as np
from skorch import NeuralNetClassifier
from skorch.hf import AccelerateMixin
from accelerate import Accelerator
from sklearn.datasets import make_classification
import torch.nn as nn
import random
from sklearn.model_selection import cross_validate
from skorch.dataset import unpack_data
# Reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
# Generate data
X, y = make_classification(10_000, 100, n_informative=5, random_state=SEED)
X = X.astype(np.float32)
y = y.astype(np.int64)
# PyTorch module
class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.dense0 = nn.Linear(100, 2)
self.nonlin = nn.Softmax(dim=-1)
def forward(self, X):
X = self.dense0(X)
X = self.nonlin(X)
return X
# Skorch wrapper
class AcceleratedNeuralNetClassifier(
AccelerateMixin,
NeuralNetClassifier
):
"""NeuralNetClassifier with HuggingFace Accelerate support"""
# First attempt
# def evaluation_step(self, batch, training=False):
# output = super().evaluation_step(batch, training=training)
# return self.accelerator.gather_for_metrics(output)
# Second attempt
def evaluation_step(self, batch, training=False):
"""Perform a forward step to produce the output used for
prediction and scoring.
Preds and targets are gathered by the accelerator before return
"""
self.check_is_fitted()
Xi, targets = unpack_data(batch)
with torch.set_grad_enabled(training):
self._set_training(training)
y_infer = self.infer(Xi)
all_y_infer, all_targets = self.accelerator.gather_for_metrics((
y_infer,
targets
))
return all_y_infer
accelerator = Accelerator()
model = AcceleratedNeuralNetClassifier(
MyModule,
accelerator=accelerator,
)
cross_validate(
model, X, y,
cv=2, scoring="average_precision", error_score="raise"
)
Both attempts produce the same error:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
epoch train_loss valid_acc valid_loss dur
------- ------------ ----------- ------------ ------
1 0.6753 0.6113 0.6674 0.0511
2 0.6300 0.6426 0.6356 0.0413
3 0.6005 0.6523 0.6136 0.0367
4 0.5796 0.6816 0.5972 0.0362
5 0.5640 0.6953 0.5845 0.0417
6 0.5520 0.7129 0.5742 0.0406
7 0.5423 0.7266 0.5658 0.0368
8 0.5344 0.7422 0.5587 0.0367
9 0.5278 0.7422 0.5527 0.0418
10 0.5222 0.7441 0.5476 0.0373
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 104, in <module>
cross_validate(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
Traceback (most recent call last):
File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 104, in <module>
results = parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
cross_validate(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
if self.dispatch_one_batch(iterator):results = parallel(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
return super().__call__(iterable_with_config)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
if self.dispatch_one_batch(iterator):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
self._dispatch(tasks)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
job = self._backend.apply_async(batch, callback=cb)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
result = ImmediateResult(func)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
return self.function(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
self.results = batch()
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
return [func(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
return self.function(*args, **kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
return self._sign * self._score_func(y, y_pred, **self._kwargs) File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
scores = scorer(estimator, X_test, y_test) File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 5120]
fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 5120]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 46386) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
multi_gpu_launcher(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
distrib_run.run(args)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
skorch_accelerate_issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-03-30_10:47:36
host : ML
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 46387)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-30_10:47:36
host : ML
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 46386)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Thanks a lot in advance for your ideas.
Good progress, I think we're getting close. Maybe I'll be able to get a multi-GPU setup to test soon.
The error now seems to be:
Found input variables with inconsistent numbers of samples: [5000, 5120]
I believe the reason is that accelerate
tries to equalize the batch sizes for each GPU. Since by default, skorch uses 128, it fills up the batches with an additional 120 dummy samples, that way each GPU can get 20*128 = 2560 samples, resulting in a total of 5120 samples. Now gather_for_metrics
should in theory remove those dummy samples again, I'm not sure what is going wrong here.
Btw. the reason why in my code snippet, I only gathered the predictions, not the target, is that the target should not come from skorch. sklearn splits the data and the y_test
that it uses never goes to torch, so it should be the correct size. We only need to gather the predictions. In the "no fluff" example, this was not the case, which is why gathering the target there, as you did, was correct.
I'll think more about it or hopefully get to test it, but meanwhile, here are some suggested solutions:
Instead of overriding evaluation_step
, we could try calling it even earlier, in infer
:
def infer(self, x, **fit_params):
y_infer = super().infer(x, **fit_params)
return self.accelerator.gather_for_metrics(y_infer)
So try adding this method instead of overriding evaluation_step
in the custom net class.
That way, accelerate should not need to create dummy samples. E.g. for 10000 samples, batch size of 100 should work. However, this is quite annoying, especially if data is split into train/valid etc. (by default, skorch uses an 80/20 split). Depending on the size of the dataset, batching without remainder might not be possible (except for batch size of 1).
This might also require passing split_batches=True
to Accelerator
, not completely sure.
This could be unsafe, i.e. it could mean that the wrong samples are truncated, but maybe it works. Add this method to the custom neural net class:
def forward(self, X, *args, **kwargs):
y_infer = super().forward(X, *args, **kwargs)
n = len(X)
is_multioutput = len(y_infer) > 0 and isinstance(y_infer[0], tuple)
if is_multioutput:
return tuple(yi[:n] for yi in y_infer)
return y_infer[:n]
The gather_for_metrics
call might not be necessary with this fix. An issue here is that if the method is incorrect, it is probably only affecting the last batch, so it might look correct because it only affects few samples.
This is of course not nice because you want to make use of those GPUs, but at least training still seems to work fine. For this, it should be sufficient to not prepare
the validation data loader:
def get_iterator(self, dataset, training=False):
iterator = super().get_iterator(dataset, training=training)
if not training:
return iterator
iterator = self.accelerator.prepare(iterator)
return iterator
I have a multi GPU instance now and can reproduce the error. Unfortunately, the solution does not work and it appears that the issue is that for some reason, accelerate does not detect that it should truncate excess samples. I'm investigating.
Great to hear that you can try it for yourself. Thanks a lot for your time. Please let me know if I can be of any help.
Okay, so I managed to kinda track down the problem. To keep it quick, the gradient_state
of accelerator
somehow diverges from the gradient_state
of the data loader, which should not happen. The latter correctly detects that the batch is finished, so the hacky solution is to override the gradient state of the accelerator by the one from the data loader.
Of course, it is still necessary to add the gather_for_metrics
call. In sum, these two methods should be added:
def evaluation_step(self, batch, training=False):
output = super().evaluation_step(batch, training=training)
return self.accelerator.gather_for_metrics(output)
def get_iterator(self, dataset, training=False):
iterator = super().get_iterator(dataset, training=training)
self.accelerator.gradient_state = iterator.gradient_state
return iterator
Could you please check that this solves your problem?
Update: I spoke an accelerate dev and the issue is most likely that sklearn sometimes creates a copy.deepcopy
of the estimator. In particular, this happens when calling any kind of hyper-parameter search and also cross_validate
. However, accelerate relies on some references that may be broken when deepcopied, therefore there is no guarantee that anything will still work after deepcopying the accelerator
instance. This would explain why you don't see any issues when using skorch without hyper-parameter search.
In the "no fluff" example I posted, I did add a deecopy
call but there it doesn't seem to cause any issue. However, this stuff is tricky to replicate, so probably it is not exactly the same as what happens with using RandomizedSearchCV
.
So what does it mean for this specific issue? Unfortunately, there is no guarantee that you will get correct results, even if the hack I posted above removes the error. I would recommend not using accelerate in this context.
Still, if you have 2 GPUs and the model is small enough that it can fit on each of them, it is possible to use grid search with skorch while leveraging both GPUs. This is documented here. Maybe that's a solution that can work for you.
Many thanks for the analysis and suggested alternative. It is a pity.
Do you think it is also unsafe to use Skorch
+ Accelerate
+ RandomizedSearchCV
on a single GPU (e.g., to benefit from DeepSpeed)?
Finally, do you think this deserves opening an issue on Accelerate
? I am not sure whether it can be considered a bug.
Anyway, thanks a lot for your help getting to the bottom of this and keep up the great work with this tool :)
Do you think it is also unsafe to use
Skorch
+Accelerate
+RandomizedSearchCV
on a single GPU (e.g., to benefit from DeepSpeed)?
Potentially it's the same issue because of the copy being created. Whether this can still cause reference issues when only one GPU is involved, I don't know. The answer is probably "it depends".
Interestingly, I did manage to find a potential solution by simply adding a __deepcopy__
method to Accelerator
. First the code:
class MyAccelerator(Accelerator):
def __deepcopy__(self, memo):
cls = type(self)
instance = cls() # <= add more arguments here if needed
return instance
# calling gather_for_metrics is still required
class MyNet(NeuralNetClassifier):
def evaluation_step(self, batch, training=False):
output = super().evaluation_step(batch, training=training)
return self.accelerator.gather_for_metrics(output)
accelerator = MyAccelerator()
net = MyNet(..., accelerator=accelerator)
cross_validate(net, ...)
For my example, it worked. Maybe you can give it a spin for your real use case and report if the results look correct. I'll consult with the accelerate devs if this could be a viable solution.
EDIT
Creating multiple instances of Accelerator
per script is a bad idea according to accelerate devs. A different solution could be:
class MyAccelerator(Accelerator):
def __deepcopy__(self, memo):
return self
Not sure if this can lead to trouble elsewhere down the line, but it works in my tests.
EDIT: changed nn.LogSoftmax
to nn.Softmax
Thanks for your reply.
I conducted some tests that seem conclusive.
I compared running the following script on:
Skorch
without Accelerate
) Skorch
with Accelerate
)Skorch
with Accelerate
)import torch
import torch.nn as nn
import numpy as np
import random
from skorch import NeuralNetClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from accelerate import Accelerator
from skorch.hf import AccelerateMixin
from sklearn.metrics import average_precision_score
# Reproducibility
SEED = 42
def seed_everything(seed=42):
torch.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.dense0 = nn.Linear(100, 2)
self.nonlin = nn.Softmax(dim=-1)
def forward(self, X):
X = self.dense0(X)
X = self.nonlin(X)
return X
class AcceleratedNeuralNetClassifier(AccelerateMixin, NeuralNetClassifier):
def evaluation_step(self, batch, training=False):
output = super().evaluation_step(batch, training=training)
return self.accelerator.gather_for_metrics(output)
class SkorchAccelerator(Accelerator):
def __deepcopy__(self, memo):
return self
seed_everything()
X, y = make_classification(
1_000, 100,
n_informative=5, random_state=SEED, flip_y=0.1
)
X = X.astype(np.float32)
y = y.astype(np.int64)
accelerator = SkorchAccelerator()
for i in range(3):
seed_everything()
model_skorch = AcceleratedNeuralNetClassifier(
accelerator=accelerator, module=MyModule,
max_epochs=1, verbose=False, batch_size=10, callbacks="disable"
)
gs = GridSearchCV(
estimator=model_skorch,
param_grid={
"lr": [0.1, 0.001],
},
scoring="average_precision",
n_jobs=1,
cv=2,
verbose=0,
refit=False,
)
gs.fit(X, y)
if accelerator.is_local_main_process:
print(f"{gs.cv_results_['params']=}")
print(f"{gs.cv_results_['mean_test_score']=}")
# Manual refit
best_model_skorch = AcceleratedNeuralNetClassifier(
accelerator=accelerator, module=MyModule,
max_epochs=1, verbose=False, batch_size=10, callbacks="disable",
**gs.best_params_
)
best_model_skorch.fit(X, y)
preds = best_model_skorch.predict_proba(X)[: , 1]
score = average_precision_score(y, preds)
if accelerator.is_local_main_process:
print(f"{score=}")
print("-"*10)
Skorch
without Accelerate
)Running CUBLAS_WORKSPACE_CONFIG=':4096:8' python script.py
NB: the script requires some edits to remove all Accelerate
business, namely:
...
# accelerator = SkorchAccelerator()
...
# model_skorch = AcceleratedNeuralNetClassifier(
model_skorch = NeuralNetClassifier(
# accelerator=accelerator,
module=MyModule,
max_epochs=1, verbose=False, batch_size=10, callbacks="disable"
)
...
# if accelerator.is_local_main_process:
print(f"{gs.cv_results_['params']=}")
print(f"{gs.cv_results_['mean_test_score']=}")
...
# best_model_skorch = AcceleratedNeuralNetClassifier(
best_model_skorch = NeuralNetClassifier(
# accelerator=accelerator,
module=MyModule,
max_epochs=1, verbose=False, batch_size=10, callbacks="disable",
**gs.best_params_
)
...
# if accelerator.is_local_main_process:
print(f"{score=}")
print("-"*10)
Output:
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
Skorch
with Accelerate
)Running CUBLAS_WORKSPACE_CONFIG=':4096:8' accelerate launch script.py
.
Output:
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
Skorch
with Accelerate
)Running CUBLAS_WORKSPACE_CONFIG=':4096:8' accelerate launch script.py
Output:
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.7486652, 0.5338035])
score=0.8295460156186707
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.7486652, 0.5338035])
score=0.8295460156186707
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.7486652, 0.5338035])
score=0.8295460156186707
----------
Seems reasonable to me. Thanks a lot.
Thanks a lot for testing, the results look very reasonable. They're not 100% the same for 3 GPUs, but I think that's to be expected.
I will update this thread if I get more feedback from accelerate devs. For now, I think we can close the issue but if you encounter new problems, feel free to re-open.
Hi,
Thanks a lot for the great tool!
I tried the recently added HuggingFace Accelerate integration. I want to perform hyper-parameters optimization using Skorch with Accelerate + ScikitLearn RandomizedSearchCV.
However, it seems that they do not play nicely at scoring time by the RandomizedSearchCV.
Reproducible example named
skorch_accelerate_issue.py
:Accelerate config to run this script on 2 GPUs on the same machine:
I ran the code using:
accelerate launch skorch_accelerate_issue.py
And here is the error:
FYI, when training starts, I can see that the two GPUs are indeed occupied. Also, when I get rid of the RandomizedSearchCV and just perform
model.fit(X, y)
, training occurs as expected on 2 GPUs.Many thanks in advance for your help.