microsoft / nlp-recipes

Natural Language Processing Best Practices & Examples
MIT License
6.37k stars 916 forks source link

[ASK] Ability to specify "weight" to each training sample. #531

Open dunalduck0 opened 4 years ago

dunalduck0 commented 4 years ago

Description

I am following "Text Classification of MultiNLI Sentences using BERT" to train a binary classifier on my data. One thing special about my data is that the training labels are noisy. Following the paper on weak supervision training, one solution is to use "noise-aware" loss function. I think it could be realized by allowing a "weight" to each training sample so that each sample is weighted differently when computing loss. But I didn't find anyway in the code to add this ability. Note that this is different from allowing weights on each label class as in CrossEntropyLoss

Other Comments

saidbleik commented 4 years ago

You can do so by sampling your examples based on some distribution. You can also do that by using a torch.utils.data.WeightedRandomSampler instead of the existing RandomSampler here: https://github.com/microsoft/nlp-recipes/blob/staging/utils_nlp/models/transformers/sequence_classification.py#L257

daden-ms commented 4 years ago

@dunalduck0 My immediate thought on this issue is that you need to pass in the weight through batch function and multiply the weight to the input_ids in https://github.com/microsoft/nlp-recipes/blob/staging/utils_nlp/models/transformers/sequence_classification.py#L76. To make the weight explainable, you probably need to avoid the case where only max_steps is specified because in your last epoch you may not iterate the whole dataset.