stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
https://crfm.stanford.edu/helm
Apache License 2.0
1.95k stars 250 forks source link

GenderPerturbation do not perturb words at the beginning of the text #1732

Open shlomihod opened 1 year ago

shlomihod commented 1 year ago

Is there a reason why GenderPerturbation apply the perturbation only on words surrounded by non-alphanumeric characters, and not for example, at the beginning of the sentence?

>>> gender_pertubation = GenderPerturbation(mode="pronouns", prob=1.0, source_class="male", target_class="female", bidirectional=False)
>>> print(gender_pertubation.perturb("he went to the market, and there he had a soup.", RNG))
he went to the market, and there she had a soup.

This happens because of the regex: https://github.com/stanford-crfm/helm/blob/daa165aae07e575bdcb3b5cca699403c82c759dc/src/helm/benchmark/augmentations/gender_perturbation.py#L195-L198

Perhaps a better option would be using the regex word boundary \b? Something like that: pattern = fr"\b({re.escape(word)})\b"

yifanmai commented 1 year ago

@shlomihod You're right; this looks like a bug. We should fix this for future releases. But I also need to think about reverse compatibility (i.e. unfortunately we can't evaluate the fixed perturbation for old models that have been deprecated).

yifanmai commented 1 year ago

@shlomihod Thanks again for catching this subtle bug!

yifanmai commented 1 year ago

cc @dilarasoylu regarding GenderPerturbation bug that we should probably fix

shlomihod commented 1 year ago

On related topic: I think there is a principled issue regarding the perturbation: depends on the text, they might have no effect.

Dialect and robustness would probably transform many words, but gender might have no impact on a text (e.g., reviews written in first person). I think there should be another metric to assess the proportions of datapoints that were changed, something like manipulation check for experiment design. Only if the manipulation is substantial (not sure what is the right way, but here is a simple heuristic for start: > 50% of the examples changed), then we are allowed to interpret the effect of the metric.