ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.09k stars 1.19k forks source link

Training percentage is not changed #2402

Open Jeffwan opened 2 years ago

Jeffwan commented 2 years ago

Describe the bug

Training: 0%| I am curious whether this should be updated during the training process? What does it mean and why it's always 0?

Training:   0%|                                                                                                                                                                                                                                   | 19/18446744073709551614 [00:30<8397595460610142:26:08,  1.64s/it]
Running evaluation for step: 20, epoch: 9
Evaluation train: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.48it/s]
Evaluation valid: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.17it/s]
Evaluation test : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.10it/s]
╒═══════════════╤═════════╤═══════════╤════════════╕
│ recommended   │    loss │   roc_auc │   accuracy │
╞═══════════════╪═════════╪═══════════╪════════════╡
│ train         │  9.4732 │    0.4988 │     0.6348 │
├───────────────┼─────────┼───────────┼────────────┤
│ validation    │ 10.2174 │    0.4984 │     0.6346 │
├───────────────┼─────────┼───────────┼────────────┤
│ test          │ 10.0118 │    0.4991 │     0.6395 │
╘═══════════════╧═════════╧═══════════╧════════════╛
╒════════════╤═════════╕
│ combined   │    loss │
╞════════════╪═════════╡
│ train      │  9.4732 │
├────────────┼─────────┤
│ validation │ 10.2174 │
├────────────┼─────────┤
│ test       │ 10.0118 │

To Reproduce Steps to reproduce the behavior:

ludwig init_config --dataset /data/rotten_tomatoes.csv --target=recommended --output /data/rotten_tomatoes.yaml
ludwig train --config rotten_tomatoes.yaml --dataset /data/rotten_tomatoes.csv

Expected behavior the number should vary?

Screenshots image

Environment (please complete the following information):

Additional context Add any other context about the problem here.

justinxzhao commented 2 years ago

Hi @Jeffwan, I'm not able to reproduce the issue you are seeing. Here's my output from your repro commands: rotten_tomatoes_output.txt

One hypothesis is that the config that gets generated from init_config sets trainer.batch_size to auto, which chooses the largest batch size that can fit in memory.

Perhaps there's something weird that happens when this is too large or too small, and the subsequent # training steps calculation becomes a ridiculously high number, i.e. 18446744073709551614.

In my run, batch_size=auto selects batch_size=32768, which is 200 training steps (2 steps per epoch, for 100 epochs).

Remarks:

  1. What batch size is auto selecting for you? This should be available in the stdout under the MODEL box.
╒═══════╕
│ MODEL │
╘═══════╛
Tuning batch size...
Exploring batch_size=2
Exploring batch_size=4
Exploring batch_size=8
Exploring batch_size=16
Exploring batch_size=32
Exploring batch_size=64
Exploring batch_size=128
Exploring batch_size=256
Exploring batch_size=512
Exploring batch_size=1024
Exploring batch_size=2048
Exploring batch_size=4096
Exploring batch_size=8192
Exploring batch_size=16384
Exploring batch_size=32768
Selected batch_size=32768
  1. Can you try setting batch size in the config manually to a fixed number, i.e. 128? That should make the error go away.
Jeffwan commented 2 years ago

Let me follow your steps and see how it goes. I will bring more details back later

Jeffwan commented 2 years ago
╒═══════╕
│ MODEL │
╘═══════╛

Warnings and other logs:
  embedding_size (50) is greater than vocab_size (7). Setting embedding size to be equal to vocab_size.
Read->Map_Batches: 100%|██████████| 1/1 [00:00<00:00, 49.73it/s]
Stage 1: 100%|██████████| 1/1 [00:00<00:00, 70.02it/s]
Stage 0: 100%|██████████| 1/1 [00:00<00:00, 68.45it/s]
(tune_batch_size_fn pid=2409) Tuning batch size...
(tune_batch_size_fn pid=2409) Exploring batch_size=2
Stage 0: : 3it [00:00, 20.79it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=4
(tune_batch_size_fn pid=2409) Exploring batch_size=8
(tune_batch_size_fn pid=2409) Exploring batch_size=16
Stage 0: : 6it [00:01,  4.53it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=32
(tune_batch_size_fn pid=2409) Exploring batch_size=64
Stage 0: : 8it [00:01,  3.71it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=128
Stage 0: : 9it [00:02,  3.47it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=256
Stage 0: : 10it [00:02,  3.21it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=512
Stage 0: : 11it [00:03,  2.93it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=1024
Stage 0: : 12it [00:03,  2.60it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=2048
Stage 0: : 13it [00:04,  2.20it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=4096
Stage 0: : 14it [00:05,  1.65it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=8192
Stage 0: : 15it [00:07,  1.05it/s]
(tune_batch_size_fn pid=2409) Exploring batch_size=16384
Stage 0: : 16it [00:10,  1.66s/it]
(tune_batch_size_fn pid=2409) Exploring batch_size=32768
Stage 0: : 17it [00:15,  2.57s/it]
(tune_batch_size_fn pid=2409) Selected batch_size=32768   -----> Same as your result. It choose 32768
Read->Map_Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Read->Map_Batches: 100%|██████████| 1/1 [00:00<00:00, 63.94it/s]

image


If I change batch_size to 128, it's same the same.

image

I feel the problem is training steps is super large number and even there's any progress, it's still 0%.

Jeffwan commented 2 years ago

@justinxzhao Do you get a chance to look at this issue?

justinxzhao commented 2 years ago

@Jeffwan Not yet, thanks for the ping. I'll plan to look at this tomorrow.

tgaddair commented 2 years ago

I wonder if this was addressed by #2455, where we disable auto batch size computation on CPU. @arnavgarg1 can you verify the behavior with batch size 128 and batch size 32768?

justinxzhao commented 1 year ago

@tgaddair there's a chance that #2455 may help, but it looks like @Jeffwan is getting the same super large number of training steps even when he tried setting batch_size=128 manually.

That said, I'm still not able to reproduce. @Jeffwan would you be able to share the backend configuration / ray cluster that you are using?

For the record, @arnavgarg1 and I have tried:

All seeing a finite/reasonable number of training steps. Here's an example of what our logs look like:

╒══════════╕
│ TRAINING │
╘══════════╛

Force reads: 100%|██████████████████████████████████████████████| 1/1 [00:00<00:00, 842.57it/s]
Force reads: 100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 4514.86it/s]
Force reads: 100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 3682.44it/s]
2022-09-21 19:52:33,437 INFO trainer.py:223 -- Trainer logs will be logged in: /home/vscode/ray_results/train_2022-09-21_19-52-33
(pid=82020) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82020) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82020)   warnings.warn(
(pid=82129) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82131) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82132) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82130) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82132) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82132)   warnings.warn(
(pid=82129) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82129)   warnings.warn(
(pid=82131) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82131)   warnings.warn(
(pid=82130) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82130)   warnings.warn(
2022-09-21 19:52:39,751 INFO trainer.py:229 -- Run results will be logged in: /home/vscode/ray_results/train_2022-09-21_19-52-33/run_001
(pid=82333) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82332) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82334) torchtext>=0.13.0 is not installed, so the following tokenizers are not available: {'bert'}
(pid=82332) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82332)   warnings.warn(
(pid=82333) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82333)   warnings.warn(
(pid=82334) /usr/local/lib/python3.8/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
(pid=82334)   warnings.warn(
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
Stage 0: : 3it [00:00, 14.82it/s]             
Stage 0: 100%|██████████| 1/1 [00:00<00:00,  3.99it/s]
Stage 0: : 2it [00:00,  5.21it/s]                     
Stage 0: 100%|██████████| 1/1 [00:00<00:00,  3.21it/s]
Stage 0: : 3it [00:00,  4.94it/s] pid=82333) 
Stage 0: : 2it [00:00,  4.92it/s]                     
Stage 0: : 3it [00:00,  6.26it/s] pid=82334) 
(PipelineSplitExecutorCoordinator pid=82332) 
Stage 0: : 5it [00:00,  4.47it/s] pid=82332) 
Training:   0%|                                                        | 0/100 [00:00<?, ?it/s](BaseWorkerMixin pid=82129) Training for 100 step(s), approximately 100 epoch(s).
(BaseWorkerMixin pid=82129) Early stopping policy: 5 round(s) of evaluation, or 5 step(s), approximately 5 epoch(s).
(BaseWorkerMixin pid=82129) 
(BaseWorkerMixin pid=82129) Starting with step 0, epoch: 0
Training:   1%|▍                                               | 1/100 [00:02<04:56,  2.99s/it](BaseWorkerMixin pid=82129) 
(BaseWorkerMixin pid=82129) Running evaluation for step: 1, epoch: 0
Stage 0: : 6it [00:04,  1.03it/s] pid=82332) 
Evaluation train: 100%|██████████████████████████████████████████| 1/1 [00:00<00:00,  1.65it/s]
Stage 0: : 4it [00:05,  1.87s/it] pid=82333) 
Evaluation valid:   0%|                                                  | 0/1 [00:00<?, ?it/s]Stage 0: : 5it [00:05,  1.24s/it] pid=82333) 
Evaluation valid: 100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 13.68it/s]
Stage 0: : 4it [00:05,  1.95s/it] pid=82334) 
Evaluation test : 100%|██████████████████████████████████████████| 1/1 [00:00<00:00,  6.60it/s]
(BaseWorkerMixin pid=82129) ╒═══════════════╤════════╤═══════════╤════════════╕
(BaseWorkerMixin pid=82129) │ recommended   │   loss │   roc_auc │   accuracy │
(BaseWorkerMixin pid=82129) ╞═══════════════╪════════╪═══════════╪════════════╡
(BaseWorkerMixin pid=82129) │ train         │ 0.7070 │    0.5000 │     0.3636 │
(BaseWorkerMixin pid=82129) ├───────────────┼────────┼───────────┼────────────┤
(BaseWorkerMixin pid=82129) │ validation    │ 0.7075 │    0.4982 │     0.3651 │
(BaseWorkerMixin pid=82129) ├───────────────┼────────┼───────────┼────────────┤
(BaseWorkerMixin pid=82129) │ test          │ 0.7071 │    0.4977 │     0.3567 │
(BaseWorkerMixin pid=82129) ╘═══════════════╧════════╧═══════════╧════════════╛
(BaseWorkerMixin pid=82129) ╒════════════╤════════╕
(BaseWorkerMixin pid=82129) │ combined   │   loss │
(BaseWorkerMixin pid=82129) ╞════════════╪════════╡
(BaseWorkerMixin pid=82129) │ train      │ 0.7074 │
(BaseWorkerMixin pid=82129) ├────────────┼────────┤
(BaseWorkerMixin pid=82129) │ validation │ 0.7078 │
(BaseWorkerMixin pid=82129) ├────────────┼────────┤
(BaseWorkerMixin pid=82129) │ test       │ 0.7074 │
(BaseWorkerMixin pid=82129) ╘════════════╧════════╛
(BaseWorkerMixin pid=82129) Validation roc_auc on recommended improved, model saved.
(BaseWorkerMixin pid=82129) 
Stage 0: : 7it [00:05,  1.20s/it] pid=82332) 
Stage 0: : 6it [00:05,  1.06s/it] pid=82333) 
Training:   2%|▉                                               | 2/100 [00:06<05:27,  3.34s/it]
Jeffwan commented 1 year ago

Let me give it another try using before & after #2455 and I will come back with the results soon.

Jeffwan commented 1 year ago

I think I figure out the issue. i was using following command to generate the config.

ludwig init_config --dataset /data/rotten_tomatoes.csv --target=recommended --hyperopt=true --time_limit_s=300 --output /data/rotten_tomatoes.yaml

However, train won't use some field from hyperopt and even we ignore the field, the value is not correct

You are running the ludwig train command but there’s a hyperopt section present in your config. It will be ignored. If you want to run hyperopt you should use the following command: ludwig hyperopt

Do you want to continue?  [Y/n]

If I use following configs, everything works fine

ludwig init_config --dataset /data/rotten_tomatoes.csv --target=recommended  --output /data/rotten_tomatoes.yaml

Following is using config file with hyper opt specific params.

Batch size tuning is not supported on CPU, setting batch size from "auto" to default value 128
Selected batch_size=128
Tuning learning rate...
Explored learning_rate=1e-08 loss=0.7992839813232422
Explored learning_rate=1.202264434617413e-08 loss=0.7849849462509155
Explored learning_rate=1.731410312890155e-08 loss=0.7618222832679749
Explored learning_rate=2.9597020956312317e-08 loss=0.7420496344566345
Explored learning_rate=5.9210411588826243e-08 loss=0.7452374696731567
Explored learning_rate=1.3607494767826928e-07 loss=0.7498859763145447
Explored learning_rate=3.513593260782811e-07 loss=0.7462422847747803
Explored learning_rate=9.943719879202584e-07 loss=0.7415341734886169
Explored learning_rate=3.004311574344183e-06 loss=0.7285543084144592
Explored learning_rate=9.435130764252818e-06 loss=0.7398301959037781
Explored learning_rate=3.0010493090372423e-05 loss=0.7328073978424072
Explored learning_rate=9.43567745568922e-05 loss=0.7291122078895569
Explored learning_rate=0.0002869460933303299 loss=0.7344697117805481
Explored learning_rate=0.0008284881008114342 loss=0.7295047640800476
Explored learning_rate=0.0022373101695598936 loss=0.7274287343025208
Explored learning_rate=0.0055881398611649 loss=0.7365333437919617
Explored learning_rate=0.012814583769872613 loss=0.7518605589866638
Explored learning_rate=0.026877863169316788 loss=0.7698249220848083
Explored learning_rate=0.051535166763349925 loss=0.8014105558395386
Explored learning_rate=0.09053238995922058 loss=0.8797544240951538
Explored learning_rate=0.14636700569884875 loss=0.9657760262489319
Explored learning_rate=0.21912913218141009 loss=1.171215534210205
Explored learning_rate=0.3060173800622756 loss=1.1691129207611084
Explored learning_rate=0.4018134698435471 loss=1.3888840675354004
Explored learning_rate=0.5001020253914709 loss=1.4000484943389893
Explored learning_rate=0.594694552171643 loss=1.443132996559143
Explored learning_rate=0.6807341379265404 loss=1.7466425895690918
Explored learning_rate=0.7552201993691219 loss=2.5223426818847656
Selected learning_rate=3.0010493090372423e-05

╒══════════╕
│ TRAINING │
╘══════════╛

Training for 2425746845692806037241 step(s), approximately 9223372036854775808 epoch(s).
Early stopping policy: -1 round(s) of evaluation, or -263 step(s), approximately -1 epoch(s).

Starting with step 0, epoch: 0
Training:   0%|                                                                                | 262/2425746845692806037241 [00:11<28559111390885009:38:08, 23.59it/s]
Running evaluation for step: 263, epoch: 0
Evaluation train: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 263/263 [00:03<00:00, 83.33it/s]
Evaluation valid: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:00<00:00, 62.63it/s]
Evaluation test : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:01<00:00, 73.64it/s]
Jeffwan commented 1 year ago

In conclusion, I would say train can not correctly calculate steps if the config has hyperopt: specified.

diff

    hyperopt:
  search_alg:
    type: hyperopt
    random_state_seed: null
  executor:
    type: ray
    num_samples: 10
    time_budget_s: 300
    scheduler:
      type: async_hyperband
      time_attr: time_total_s
      max_t: 300
      grace_period: 72
      reduction_factor: 5
    cpu_resources_per_trial: 1
  parameters:
    trainer.learning_rate:
      space: choice
      categories:
      - 0.005
      - 0.01
      - 0.02
      - 0.025
    trainer.decay_rate:
      space: choice
      categories:
      - 0.8
      - 0.9
      - 0.95
    trainer.decay_steps:
      space: choice
      categories:
      - 500
      - 2000
      - 8000
      - 10000
      - 20000
    combiner.size:
      space: choice
      categories:
      - 8
      - 16
      - 24
      - 32
      - 64
    combiner.output_size:
      space: choice
      categories:
      - 8
      - 16
      - 24
      - 32
      - 64
      - 128
    combiner.num_steps:
      space: choice
      categories:
      - 3
      - 4
      - 5
      - 6
      - 7
      - 8
      - 9
      - 10
    combiner.relaxation_factor:
      space: choice
      categories:
      - 1.0
      - 1.2
      - 1.5
      - 2.0
    combiner.sparsity:
      space: choice
      categories:
      - 0.0
      - 1.0e-06
      - 0.0001
      - 0.001
      - 0.01
      - 0.1
    combiner.bn_virtual_bs:
      space: choice
      categories:
      - 256
      - 512
      - 1024
      - 2048
      - 4096
    combiner.bn_momentum:
      space: choice
      categories:
      - 0.4
      - 0.3
      - 0.2
      - 0.1
      - 0.05
      - 0.02
  output_feature: recommended
  metric: roc_auc
  goal: maximize
arnavgarg1 commented 1 year ago

Hi @Jeffwan - great catch, and thanks for sharing. I was able to reproduce this locally using the steps you mentioned. I will investigate and have a fix for you next week.

arnavgarg1 commented 1 year ago

Hi @Jeffwan, I've temporarily merged in a PR to prevent this from causing problems when running training through the CLI in the same way you're doing in this example. You can find a fix in Ludwig 0.6.2, which was released today.

There is still a problem when running this programmatically if you download our Github, manually install, and then programatically perform model training via the LudwigModel object. A follow up PR will be merged to address the deeper underlying issue in the next few weeks, but hopefully that unblocks you for now without having to remove the hyperopt part of your config from init_config.