Closed Jeffwan closed 1 month ago
Hi @Jeffwan, I'm not able to reproduce the issue you are seeing. Here's my output from your repro commands: rotten_tomatoes_output.txt
One hypothesis is that the config that gets generated from init_config
sets trainer.batch_size
to auto
, which chooses the largest batch size that can fit in memory.
Perhaps there's something weird that happens when this is too large or too small, and the subsequent # training steps calculation becomes a ridiculously high number, i.e. 18446744073709551614
.
In my run, batch_size=auto
selects batch_size=32768
, which is 200 training steps (2 steps per epoch, for 100 epochs).
Remarks:
auto
selecting for you? This should be available in the stdout under the MODEL
box.╒═══════╕
│ MODEL │
╘═══════╛
Tuning batch size...
Exploring batch_size=2
Exploring batch_size=4
Exploring batch_size=8
Exploring batch_size=16
Exploring batch_size=32
Exploring batch_size=64
Exploring batch_size=128
Exploring batch_size=256
Exploring batch_size=512
Exploring batch_size=1024
Exploring batch_size=2048
Exploring batch_size=4096
Exploring batch_size=8192
Exploring batch_size=16384
Exploring batch_size=32768
Selected batch_size=32768
128
? That should make the error go away.Let me follow your steps and see how it goes. I will bring more details back later
╒═══════╕
│ MODEL │
╘═══════╛
Warnings and other logs:
embedding_size (50) is greater than vocab_size (7). Setting embedding size to be equal to vocab_size.
Read->Map_Batches: 100%|██████████| 1/1 [00:00<00:00, 49.73it/s]
Stage 1: 100%|██████████| 1/1 [00:00<00:00, 70.02it/s]
Stage 0: 100%|██████████| 1/1 [00:00<00:00, 68.45it/s]
(tune_batch_size_fn pid=2409) Tuning batch size...
(tune_batch_size_fn pid=2409) Exploring batch_size=2
Stage 0: : 3it [00:00, 20.79it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=4
(tune_batch_size_fn pid=2409) Exploring batch_size=8
(tune_batch_size_fn pid=2409) Exploring batch_size=16
Stage 0: : 6it [00:01, 4.53it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=32
(tune_batch_size_fn pid=2409) Exploring batch_size=64
Stage 0: : 8it [00:01, 3.71it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=128
Stage 0: : 9it [00:02, 3.47it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=256
Stage 0: : 10it [00:02, 3.21it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=512
Stage 0: : 11it [00:03, 2.93it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=1024
Stage 0: : 12it [00:03, 2.60it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=2048
Stage 0: : 13it [00:04, 2.20it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=4096
Stage 0: : 14it [00:05, 1.65it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=8192
Stage 0: : 15it [00:07, 1.05it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=16384
Stage 0: : 16it [00:10, 1.66s/it]
(tune_batch_size_fn pid=2409) Exploring batch_size=32768
Stage 0: : 17it [00:15, 2.57s/it]
(tune_batch_size_fn pid=2409) Selected batch_size=32768 -----> Same as your result. It choose 32768
Read->Map_Batches: 0%| | 0/1 [00:00<?, ?it/s]
Read->Map_Batches: 100%|██████████| 1/1 [00:00<00:00, 63.94it/s]
If I change batch_size
to 128, it's same the same.
I feel the problem is training steps is super large number and even there's any progress, it's still 0%.
@justinxzhao Do you get a chance to look at this issue?
@Jeffwan Not yet, thanks for the ping. I'll plan to look at this tomorrow.
I wonder if this was addressed by #2455, where we disable auto batch size computation on CPU. @arnavgarg1 can you verify the behavior with batch size 128 and batch size 32768?
@tgaddair there's a chance that #2455 may help, but it looks like @Jeffwan is getting the same super large number of training steps even when he tried setting batch_size=128
manually.
That said, I'm still not able to reproduce. @Jeffwan would you be able to share the backend configuration / ray cluster that you are using?
For the record, @arnavgarg1 and I have tried:
All seeing a finite/reasonable number of training steps. Here's an example of what our logs look like:
╒══════════╕
│ TRAINING │
╘══════════╛
Force reads: 100%|██████████████████████████████████████████████| 1/1 [00:00<00:00, 842.57it/s]
Force reads: 100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 4514.86it/s]
Force reads: 100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 3682.44it/s]
2022-09-21 19:52:33,437 INFO trainer.py:223 -- Trainer logs will be logged in: /home/vscode/ray_results/train_2022-09-21_19-52-33
(pid=82020) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82020) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82020) warnings.warn(
(pid=82129) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82131) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82132) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82130) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82132) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82132) warnings.warn(
(pid=82129) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82129) warnings.warn(
(pid=82131) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82131) warnings.warn(
(pid=82130) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82130) warnings.warn(
2022-09-21 19:52:39,751 INFO trainer.py:229 -- Run results will be logged in: /home/vscode/ray_results/train_2022-09-21_19-52-33/run_001
(pid=82333) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82332) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82334) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82332) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82332) warnings.warn(
(pid=82333) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82333) warnings.warn(
(pid=82334) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82334) warnings.warn(
Stage 0: 0%| | 0/1 [00:00<?, ?it/s]
Stage 0: 0%| | 0/1 [00:00<?, ?it/s]
Stage 0: 0%| | 0/1 [00:00<?, ?it/s]
Stage 0: : 3it [00:00, 14.82it/s]
Stage 0: 100%|██████████| 1/1 [00:00<00:00, 3.99it/s]
Stage 0: : 2it [00:00, 5.21it/s]
Stage 0: 100%|██████████| 1/1 [00:00<00:00, 3.21it/s]
Stage 0: : 3it [00:00, 4.94it/s] pid=82333)
Stage 0: : 2it [00:00, 4.92it/s]
Stage 0: : 3it [00:00, 6.26it/s] pid=82334)
(PipelineSplitExecutorCoordinator pid=82332)
Stage 0: : 5it [00:00, 4.47it/s] pid=82332)
Training: 0%| | 0/100 [00:00<?, ?it/s](BaseWorkerMixin pid=82129) Training for 100 step(s), approximately 100 epoch(s).
(BaseWorkerMixin pid=82129) Early stopping policy: 5 round(s) of evaluation, or 5 step(s), approximately 5 epoch(s).
(BaseWorkerMixin pid=82129)
(BaseWorkerMixin pid=82129) Starting with step 0, epoch: 0
Training: 1%|▍ | 1/100 [00:02<04:56, 2.99s/it](BaseWorkerMixin pid=82129)
(BaseWorkerMixin pid=82129) Running evaluation for step: 1, epoch: 0
Stage 0: : 6it [00:04, 1.03it/s] pid=82332)
Evaluation train: 100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 1.65it/s]
Stage 0: : 4it [00:05, 1.87s/it] pid=82333)
Evaluation valid: 0%| | 0/1 [00:00<?, ?it/s]Stage 0: : 5it [00:05, 1.24s/it] pid=82333)
Evaluation valid: 100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 13.68it/s]
Stage 0: : 4it [00:05, 1.95s/it] pid=82334)
Evaluation test : 100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 6.60it/s]
(BaseWorkerMixin pid=82129) ╒═══════════════╤════════╤═══════════╤════════════╕
(BaseWorkerMixin pid=82129) │ recommended │ loss │ roc_auc │ accuracy │
(BaseWorkerMixin pid=82129) ╞═══════════════╪════════╪═══════════╪════════════╡
(BaseWorkerMixin pid=82129) │ train │ 0.7070 │ 0.5000 │ 0.3636 │
(BaseWorkerMixin pid=82129) ├───────────────┼────────┼───────────┼────────────┤
(BaseWorkerMixin pid=82129) │ validation │ 0.7075 │ 0.4982 │ 0.3651 │
(BaseWorkerMixin pid=82129) ├───────────────┼────────┼───────────┼────────────┤
(BaseWorkerMixin pid=82129) │ test │ 0.7071 │ 0.4977 │ 0.3567 │
(BaseWorkerMixin pid=82129) ╘═══════════════╧════════╧═══════════╧════════════╛
(BaseWorkerMixin pid=82129) ╒════════════╤════════╕
(BaseWorkerMixin pid=82129) │ combined │ loss │
(BaseWorkerMixin pid=82129) ╞════════════╪════════╡
(BaseWorkerMixin pid=82129) │ train │ 0.7074 │
(BaseWorkerMixin pid=82129) ├────────────┼────────┤
(BaseWorkerMixin pid=82129) │ validation │ 0.7078 │
(BaseWorkerMixin pid=82129) ├────────────┼────────┤
(BaseWorkerMixin pid=82129) │ test │ 0.7074 │
(BaseWorkerMixin pid=82129) ╘════════════╧════════╛
(BaseWorkerMixin pid=82129) Validation roc_auc on recommended improved, model saved.
(BaseWorkerMixin pid=82129)
Stage 0: : 7it [00:05, 1.20s/it] pid=82332)
Stage 0: : 6it [00:05, 1.06s/it] pid=82333)
Training: 2%|▉ | 2/100 [00:06<05:27, 3.34s/it]
Let me give it another try using before & after #2455 and I will come back with the results soon.
I think I figure out the issue. i was using following command to generate the config.
ludwig init_config --dataset /data/rotten_tomatoes.csv --target=recommended --hyperopt=true --time_limit_s=300 --output /data/rotten_tomatoes.yaml
However, train won't use some field from hyperopt and even we ignore the field, the value is not correct
You are running the ludwig train command but there’s a hyperopt section present in your config. It will be ignored. If you want to run hyperopt you should use the following command: ludwig hyperopt
Do you want to continue? [Y/n]
If I use following configs, everything works fine
ludwig init_config --dataset /data/rotten_tomatoes.csv --target=recommended --output /data/rotten_tomatoes.yaml
Following is using config file with hyper opt specific params.
Batch size tuning is not supported on CPU, setting batch size from "auto" to default value 128
Selected batch_size=128
Tuning learning rate...
Explored learning_rate=1e-08 loss=0.7992839813232422
Explored learning_rate=1.202264434617413e-08 loss=0.7849849462509155
Explored learning_rate=1.731410312890155e-08 loss=0.7618222832679749
Explored learning_rate=2.9597020956312317e-08 loss=0.7420496344566345
Explored learning_rate=5.9210411588826243e-08 loss=0.7452374696731567
Explored learning_rate=1.3607494767826928e-07 loss=0.7498859763145447
Explored learning_rate=3.513593260782811e-07 loss=0.7462422847747803
Explored learning_rate=9.943719879202584e-07 loss=0.7415341734886169
Explored learning_rate=3.004311574344183e-06 loss=0.7285543084144592
Explored learning_rate=9.435130764252818e-06 loss=0.7398301959037781
Explored learning_rate=3.0010493090372423e-05 loss=0.7328073978424072
Explored learning_rate=9.43567745568922e-05 loss=0.7291122078895569
Explored learning_rate=0.0002869460933303299 loss=0.7344697117805481
Explored learning_rate=0.0008284881008114342 loss=0.7295047640800476
Explored learning_rate=0.0022373101695598936 loss=0.7274287343025208
Explored learning_rate=0.0055881398611649 loss=0.7365333437919617
Explored learning_rate=0.012814583769872613 loss=0.7518605589866638
Explored learning_rate=0.026877863169316788 loss=0.7698249220848083
Explored learning_rate=0.051535166763349925 loss=0.8014105558395386
Explored learning_rate=0.09053238995922058 loss=0.8797544240951538
Explored learning_rate=0.14636700569884875 loss=0.9657760262489319
Explored learning_rate=0.21912913218141009 loss=1.171215534210205
Explored learning_rate=0.3060173800622756 loss=1.1691129207611084
Explored learning_rate=0.4018134698435471 loss=1.3888840675354004
Explored learning_rate=0.5001020253914709 loss=1.4000484943389893
Explored learning_rate=0.594694552171643 loss=1.443132996559143
Explored learning_rate=0.6807341379265404 loss=1.7466425895690918
Explored learning_rate=0.7552201993691219 loss=2.5223426818847656
Selected learning_rate=3.0010493090372423e-05
╒══════════╕
│ TRAINING │
╘══════════╛
Training for 2425746845692806037241 step(s), approximately 9223372036854775808 epoch(s).
Early stopping policy: -1 round(s) of evaluation, or -263 step(s), approximately -1 epoch(s).
Starting with step 0, epoch: 0
Training: 0%| | 262/2425746845692806037241 [00:11<28559111390885009:38:08, 23.59it/s]
Running evaluation for step: 263, epoch: 0
Evaluation train: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 263/263 [00:03<00:00, 83.33it/s]
Evaluation valid: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:00<00:00, 62.63it/s]
Evaluation test : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:01<00:00, 73.64it/s]
In conclusion, I would say train
can not correctly calculate steps if the config has hyperopt:
specified.
diff
hyperopt:
search_alg:
type: hyperopt
random_state_seed: null
executor:
type: ray
num_samples: 10
time_budget_s: 300
scheduler:
type: async_hyperband
time_attr: time_total_s
max_t: 300
grace_period: 72
reduction_factor: 5
cpu_resources_per_trial: 1
parameters:
trainer.learning_rate:
space: choice
categories:
- 0.005
- 0.01
- 0.02
- 0.025
trainer.decay_rate:
space: choice
categories:
- 0.8
- 0.9
- 0.95
trainer.decay_steps:
space: choice
categories:
- 500
- 2000
- 8000
- 10000
- 20000
combiner.size:
space: choice
categories:
- 8
- 16
- 24
- 32
- 64
combiner.output_size:
space: choice
categories:
- 8
- 16
- 24
- 32
- 64
- 128
combiner.num_steps:
space: choice
categories:
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
combiner.relaxation_factor:
space: choice
categories:
- 1.0
- 1.2
- 1.5
- 2.0
combiner.sparsity:
space: choice
categories:
- 0.0
- 1.0e-06
- 0.0001
- 0.001
- 0.01
- 0.1
combiner.bn_virtual_bs:
space: choice
categories:
- 256
- 512
- 1024
- 2048
- 4096
combiner.bn_momentum:
space: choice
categories:
- 0.4
- 0.3
- 0.2
- 0.1
- 0.05
- 0.02
output_feature: recommended
metric: roc_auc
goal: maximize
Hi @Jeffwan - great catch, and thanks for sharing. I was able to reproduce this locally using the steps you mentioned. I will investigate and have a fix for you next week.
Hi @Jeffwan, I've temporarily merged in a PR to prevent this from causing problems when running training through the CLI in the same way you're doing in this example. You can find a fix in Ludwig 0.6.2, which was released today.
There is still a problem when running this programmatically if you download our Github, manually install, and then programatically perform model training via the LudwigModel
object. A follow up PR will be merged to address the deeper underlying issue in the next few weeks, but hopefully that unblocks you for now without having to remove the hyperopt part of your config from init_config.
Describe the bug
Training: 0%|
I am curious whether this should be updated during the training process? What does it mean and why it's always 0?To Reproduce Steps to reproduce the behavior:
Expected behavior the number should vary?
Screenshots
Environment (please complete the following information):
Additional context Add any other context about the problem here.