ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.09k stars 1.19k forks source link

RuntimeError if just one class in multi-class output feature #2994

Open Peetee06 opened 1 year ago

Peetee06 commented 1 year ago

Describe the bug If the training data contains just one class in a multi-class output feature, pytorch raises a RuntimeException. From the error message I would guess that torch does not expect a class probability of 1, but rather 1.0. I'm not 100% sure if this issue stems from ludwig or is calculated by pytorch.

Traceback (most recent call last):
  File "src/train.py", line 81, in <module>
    train()
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "src/train.py", line 64, in train
    model.train(
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/ludwig/api.py", line 557, in train
    train_stats = trainer.train(
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/ludwig/trainers/trainer.py", line 817, in train
    should_break = self._train_loop(
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/ludwig/trainers/trainer.py", line 954, in _train_loop
    loss, all_losses = self.train_step(
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/ludwig/trainers/trainer.py", line 220, in train_step
    loss, all_losses = self.model.train_loss(
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/ludwig/models/base.py", line 204, in train_loss
    of_train_loss = of_obj.train_loss(targets[of_name], predictions, of_name)
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/ludwig/features/base_feature.py", line 268, in train_loss
    return self.train_loss_function(predictions[prediction_key], targets)
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/ludwig/modules/loss_modules.py", line 205, in forward
    return self.loss_fn(preds, target)
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected floating point type for target with class probabilities, got Long

I used the following config:

input_features:
  -
    name: audio_path
    type: audio
    preprocessing:
      audio_file_length_limit_in_s: 1.0
      type: fbank
      window_length_in_s: 0.025
      window_shift_in_s: 0.01
      num_filter_bands: 80
      norm: per_file
output_features:
    -
      name: label
      type: category
      num_classes: 2
trainer:
  epochs: 1  

To Reproduce Steps to reproduce the behavior:

  1. Train a model using category output feature with the training data only containing one class.

Expected behavior The training will finish without an Exception. Or exits, saying there is only one class present.

arnavgarg1 commented 1 year ago

@Peetee06 Thanks for flagging this.

If I understand correctly, your output feature label only has 1 class? If yes, then this is actually an expected error (that in this case is raised by PyTorch).

At a very high level, it doesn't really make sense to train a machine learning model where the inputs vary but the output feature value is the same. Intuitively, one could just guess the same output value every single time and be right 100% of the time. If that is the case, one wouldn't really need a machine learning model since the "learning" being done is rather trivial. At a slightly more technical level, the reason this doesn't make sense is that your model is always going to learn to predict the same output value with high confidence, leading to the right prediction every time. This will result in virtually no back-prop happening and the weights won't get updated.

A couple of different things I wanted to note:

  1. Ludwig does allow input categorical features with a single value, but not output features with a single value. So if you had either just specified:
output_features:
    -
      name: label
      type: category

or

output_features:
    -
      name: label
      type: category
      num_classes: 1

Ludwig would have raised a validation error upfront letting you know about the issue and what to do to resolve it. https://github.com/ludwig-ai/ludwig/blob/master/ludwig/features/category_feature.py#L124

  1. As you've correctly pointed out, the error raised by PyTorch does seem to come from the weights assignment, where I believe it's set to 1 instead of 1.0 like you've pointed out. On my end, I can work on making sure that we always cast these values to float to make sure this doesn't happen in the future.

Let me know what you think!

Peetee06 commented 1 year ago

Thanks for your answer, @arnavgarg1

I mean to train on 2 classes but accidentally generated training data that contained only 1 class.

I think casting to float will resolve the error.

Additionally it might be useful to check if the training data contains the same amount of classes as specified in the model configuration's output_features. Ludwig could notify the user when there are mismatching number of classes. What do you think?

arnavgarg1 commented 1 year ago

@Peetee06 Definitely agreed with both of these things!

Are you potentially interested in contributing to Ludwig and making these fixes? Happy to work with you through them!

Peetee06 commented 1 year ago

@arnavgarg1 definitely!

That will be my first contribution to any open source project. How do I go about this? I saw there are guidelines in the docs. Do you have any additional info/tips?

justinxzhao commented 1 year ago

Hi @Peetee06, it's great to hear about your interest to contribute to Ludwig!

We have a contributing guide here: https://github.com/ludwig-ai/ludwig/blob/master/CONTRIBUTING.md

Also, I recommend joining our slack!

Peetee06 commented 1 year ago

Hi @justinxzhao, I read the guide and joined the Slack. Will get familiar with the repo next.

arnavgarg1 commented 1 year ago

@dalianaliu Re-assigning this to you since you expressed interest! Could be awesome for @Peetee06 and you to work together on fixing this issue in Ludwig :)

@justinxzhao and I are happy to help where needed!

Peetee06 commented 1 year ago

I've been trying to find the source in the code. Currently looking at line 204 in ludwig/modules/loss_modules.py:

target = target.long()

If I understand that correctly, the target Tensor gets converted to Long here and then passed on up to here:

File "/home/trost/miniconda3/envs/mlflow-95b8fe275e9987dafa102b0ee13278631dbef1f5/lib/python3.8/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)

The error says it expects a floating point type. Why are we converting to long here?

arnavgarg1 commented 1 year ago

Hey @Peetee06, sorry last week got away from me and I haven't had a chance to look. Work has been pretty busy. I do plan to fix all of these issues possibly over this week!

dennisrall commented 1 year ago

Are there any updates on this issue?