pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.87k stars 18.02k forks source link

QST: "Dummy" is rooted in ableist language #35724

Open RollingStar opened 4 years ago

RollingStar commented 4 years ago

Question about pandas

Although extremely common in the industry, "dummy" has some unfortunate history. One current use is for substitutes - mannequins, stand-ins, etc. This use grew from its original definition, "mute person". Mute people are not substitutes or stand-ins and I would prefer Pandas to not contribute to this view. There are other words, like "indicator", for statistics.

Pandas currently uses "get_dummies" as a function name, with the documentation referencing "indicator" as a synonym.

Citations:

  1. https://www.etymonline.com/word/dummy
  2. https://www.etymonline.com/word/dumb
  3. https://www.etymonline.com/word/indicator
TomAugspurger commented 4 years ago

Thanks for opening an issue.

How should we balance this against the cost of changing it to something like get_indicators or onehot_encode (the deprecation warnings users would see, and need to update for)? I'm having trouble weighing the two in my head.

MarcoGorelli commented 4 years ago

Granted I'm punching above my weight by commenting on API design, but would it be possible to make get_indicators an alias of get_dummies, so it can be used whilst not breaking other people's code?

Given how painfully common it is to see warnings.filterwarnings("ignore"), I fear deprecation warnings would be ignored by many

galipremsagar commented 3 years ago

+1 for get_indicators

MarcoGorelli commented 2 years ago

@pandas-dev/pandas-core anyone have any thoughts/objections on going through a deprecation cycle to rename get_dummies to get_indicators in version 2.0?

simonjayhawkins commented 2 years ago

we should be consistent with the rest of the ecosystem. What are other projects doing?

MarcoGorelli commented 2 years ago

sklearn calls it OneHotEncoder

get_onehotencoding?

datapythonista commented 2 years ago

My understanding is that one hot encode and dummies are almost the same but not completly. OHE has a column per category in the results, while dummies has one less to avoid redundancy.

I'm +1 in using the actual names and avoid confusion by finding equivalents. Unless there are reports of people or communities getting personally offended by our wording, which I don't think it's the case.

But no big deal if the rest of devs have consensus in renaming. Just I think we're wasting our users time but a renaming for a reason I personally don't see being insulting or a problem to anyone (I may be wrong).

toobaz commented 2 years ago

Can't judge on whether to do drop "dummies" (English is not my mother tongue and I had never associated a pejorative effect to the "mannequin" meaning of "dummy"), but on what to replace it with, I'm pretty sure "OneHotEncoding" would sound weird to most people I know (social scientists). We'd rather then go for get_booleans - which at least is a term pandas users are likely to already be accustomed with (although I do see the downside that the returned dtype is int, not bool). Or even just "categories" (despite the returned dtype not begin categorical :-D ). I would still consider "indicators" better than "OneHotEncoding".

By the way: I don't see references to ableism, or even just to controversy in naming, in Wikipedia ... any link to better understand the issue?

MarcoGorelli commented 2 years ago

OHE has a column per category in the results, while dummies has one less to avoid redundancy.

By default, get_dummies also has one column per category. There is a drop_first argument, but the default value is False. Likewise, OHE has one column per category by default, but has an argument drop with which you can drop the first value.

By the way: I don't see references to ableism, or even just to controversy in naming, in Wikipedia ... any link to better understand the issue?

For a start it's discouraged in the Google Developer documentation style guide

toobaz commented 2 years ago

For a start it's discouraged in the Google Developer documentation style guide

(True, but without any references and any mention of ableism... and surrounded by dozens of discouraged and definitely not offensive terms, and ironically linking to the Wikipedia page with the discouraged name)

datapythonista commented 2 years ago

And I wouldn't take as a reference of morality a company in the business of mass surveilance, censorship, monopoly abuses, political inference and brainwashing. ;)

Dr-Irv commented 2 years ago

The term "dummy variable" is in wide use in how people learn about encoding categorical data and is not specific to software. The references on the Wikipedia article refer to a paper from 1957 that uses the term. It probably appears in statistics textbooks. SPSS uses the term in their documentation. IMHO, until the statistics/data science community at large decides to deprecate the language, it's not our responsibility to take the lead in doing so.

Having said that, having an alternate name such as get_indicators() is appropriate, but I think we should not deprecate get_dummies() and just leave it there, but no longer document it.

I also have to wonder how the publishers of the "XYZ for Dummies" series of books have handled this issue.

Finally I found it interesting to contrast the order of the definitions of the word "dummy" shown in these three references: https://www.merriam-webster.com/dictionary/dummy https://www.dictionary.com/browse/dummy https://dictionary.cambridge.org/us/dictionary/english/dummy

For Merriam-Webster, the first category is related to not speaking or being stupid. For dictionary.com, the first category is "a representation or copy of something, as for displaying to indicate appearance:" For the Cambridge dictionary, the first category is "a large model of a human, especially one used to show clothes in a store"

One of the definitions reminded me that the term is also used as a word for a baby's pacifier.

MarcoGorelli commented 2 years ago

Hey @toobaz - what kind of reference are you looking for? You can find articles like this one which refers to it as an ableist term if you look. Likewise this one from HBR. I won't overload you with links though, as I'm not sure what you're looking for or what kind of source you'd accept

I think we should not deprecate get_dummies() and just leave it there, but no longer document it.

Agreed, I'd suggested this at the top - rename it, but continue to silently support the current name and not break people's code

bashtage commented 2 years ago

I also feel that dummy variable is so wide spread that renaming would create a lot of confusion. It is much more commonly taught that one hot encoding. The wiki article for one-hot mentions dummy in the first line - the reverse is not true and one-hot only makes it into a footnote.

I think there isn't a settled alternative to dummy, and until there is no change should be made. One the world converges into something that mostly stops using dummy variable, then that should be adopted. More or less how the master->main change worked in pandas.

toobaz commented 2 years ago

You can find articles like this one which refers to it as an ableist term if you look. Likewise this one from HBR.

Are you sure? I can't find the word "dummy" in any of the two, even less a reference to the "mannequin" meaning. But yes, these would have been otherwise somewhat better references than the Google documentation style guide.

MarcoGorelli commented 2 years ago

OK, they use "dumb", from which "dummy" comes from (https://www.etymonline.com/word/dummy)

I'm OK with not doing this if others would prefer not to anyway, it just seemed like the moment the bring this up as else it'll be a while until 3.0

MarcoGorelli commented 2 years ago

Doesn't look like there's much support for renaming, so let's close for now to keep the queue down - the discussion can always be reopened in the future if necessary

Thanks anyway @RollingStar for the suggestion!

davidcavazos commented 2 years ago

From #48250 to keep the discussion in one place

The word "dummy" from the pd.get_dummies function can be offensive to some people and should be renamed.

It's marked as a word that should not be used by Google's inclusive language word list.

@TheNeuralBit commented:

A non-Google reference for "dummy" being non-inclusive: https://itconnect.uw.edu/guides-by-topic/identity-diversity-inclusion//inclusive-language-guide/

Why it’s problematic: The origin of the word, “dummy,” is a person who cannot speak. Because the use of this word is often negatively associated with a disability, implying a person is worthless, ineffective or incapable, an alternative word should be used.

Some other sources which flag the use of the word "dummy":

Another document mentioning how it causes harm:

“Dummy” and similar terms stigmatize mental disabilities. The alternatives are clearer.

MarcoGorelli commented 2 years ago

Thanks @davidcavazos

Reopening then - perhaps we can discuss this in the next dev meeting (which btw anyone is welcome to attend)

davidcavazos commented 2 years ago

So far, some alternatives are:

Maybe we could open a voting to finalize the name.

davidcavazos commented 2 years ago

Thanks @davidcavazos

Reopening then - perhaps we can discuss this in the next dev meeting (which btw anyone is welcome to attend)

Thanks, I've added it to my calendar

TheNeuralBit commented 2 years ago

Another point I'd like to bring over from https://github.com/pandas-dev/pandas/issues/48250#issuecomment-1227642106:

My takeaway from [the discussion in #35724] is that adding a separate get_indicators (or some other agreed upon alternative) would be amenable. From there we could either:

  • Deprecate and ultimately remove get_dummies, or
  • Prefer get_indicators in documentation to nudge users there

It seems the former was rejected, but the latter could be acceptable. Could we pursue that approach?

jbrockmendel commented 2 years ago

can be offensive to some people

Is there evidence on this?

Some other sources which flag the use of the word "dummy":

I do not find these compelling. They also suggest replacing "normal" with "typical". Should scipy/statsmodels deprecate references to the Normal Distribution?

https://twitter.com/jbarro/status/1467250971361386505

“inclusive language” — that is, the creation of a long list of weird required adjustments to language, separating those who know and subscribe to all the latest rules from those who don’t — is not actually inclusive.

davidcavazos commented 2 years ago

Nobody expects anyone to know all the words, but there's also a long historical background of poor choices of words which convey a negative context or are sensitive to groups of people (like master, slave, kill, etc). Fortunately there are people who have invested the time of compiling these words into lists to make them more searchable. Many fall into gray areas, but there are some which make sense to change. That's why GitHub renamed the main branch name from master, even if that was pretty disruptive at the time.

attack68 commented 2 years ago

"Master and slave" is such an unequivocal and obvious corporate reputational risk that it had to be changed.

In my opinion, "dummy" in the context of dummy variables offers no offensive connotation. Dictionary definitions of dummy variables make no reference to it, the wikipedia article on dummy variables makes no reference and the widespread use of it in scientific papers suggests to me anyone finding that particular use offensive in that context is overly sensitive. I consider the language to have evolved.

I am -1 on changing for the sole purpose of sensibility. Other mentions for including other functions names if they are synchronised with other libraries I am +0.5 for.

toobaz commented 2 years ago

I agree with @attack68, and let me add that "master and slave" is computer science jargon from which programmers are agreeing to transition on, including pandas programmers.

"Dummy" is established, technical jargon from statistics that pandas has adopted from, not imposed to, its users, who are mostly not involved in its development. We do not decide how statisticians (or whoever follows a statistics class) talk, and as of now, I have no doubt that our users will find "get_dummy" more understandable than the alternatives.

Now, I would never say "let's stick to what this mass of people do, whatever the harm we cause to users". But as mentioned by others, there is is no indication that get_dummies is causing harm to any group/community. The technical use is derived from a meaning unrelated to disabilities ("mannequin") that itself is well established in common parlance since almost two centuries.

Dr-Irv commented 2 years ago

We do not decide how statisticians (or whoever follows a statistics class) talk, and as of now, I have no doubt that our users will find "get_dummy" more understandable than the alternatives.

I said something similar above: https://github.com/pandas-dev/pandas/issues/35724#issuecomment-1033944660 The usage is well-established in the statistics literature and in packages like SAS and SPSS.

As a compromise, I'd like to suggest that we should create a get_indicators() method that is the same as get_dummies(), document get_indicators(), but leave get_dummies() in the API and just remove any documentation of it.

Since in Britain, a baby's pacifier is called a dummy, maybe this suggestion will pacify those who object to the current method name.

toobaz commented 2 years ago

The usage is well-established in the statistics literature and in packages like SAS and SPSS.

... and Stata, and R... the latter goes as far as providing a dummify function.

As a compromise, I'd like to suggest that we should create a get_indicators() method that is the same as get_dummies(), document get_indicators(), but leave get_dummies() in the API and just remove any documentation of it.

If we decided to go this route, the get_dummies() docstring should at least be "See get_indicators()"

mroeschke commented 2 years ago

Noting that 1.5 just added from_dummies, so that method would need the same treatment as well: https://pandas.pydata.org/docs/dev/whatsnew/v1.5.0.html#from-dummies

jreback commented 2 years ago

-0.5 on any change; as indicated this is a very common term

not completely averse though as this is a one hot encoding operation ; we could name similarly to sklearn

kennknowles commented 2 years ago

I'm someone who isn't deeply involved with statistics or whatever realm this odd use of the term comes from. So at first glance it doesn't even make sense. By far the more common usage is "placeholder". So for people like me, which I think is most people in this case, the term is also esoteric or misleading, even aside from insensitivity. That's probably why the official docs immediately clarify with an alternate term "indicator" that is more common and sensible. Adding get_indicators and leaving get_dummies undocumented just for backwards-compatibility will improve the library for everyone.

kennknowles commented 2 years ago

When this was brought up to me, I had to look up what it did, and was surprised at what a terrible name this is for the function. But from this thread I do understand it is stats jargon. So my take is just an external view, that this particular piece of jargon is exceptionally badly chosen and there are multiple better choices in even broader use.

toobaz commented 2 years ago

By far the more common usage is "placeholder".

What are you basing your statements on? 4 different software packages were named from which people often move to pandas, and they all use "dummy". Sklearn has "one hot encoding". Then for sure pandas isn't perfectly equivalent to any of these, but I don't know anyone or anything that uses "placeholder". Wikipedia, in the page "dummy variable" (yes) does provide 6 alternatives: "indicator variable" is the first, "placeholder" is not one of them.

bashtage commented 2 years ago

In stats, I would say in order of commonality (with 3 and 4 being much rarer than 3 and 4):

  1. dummy
  2. indicator
  3. binary
  4. dichotomous

In ML, one hot encoding is common, although this is not a description of the variable rather than the method used to create the dichotomous values.

bashtage commented 2 years ago

I feel like some of the confusion is based on the usage of dummy in comp sci, which is often a simple version of something complex. The usage of dummy in statistics is not the same, and IME the intent of the word dummy in the context of the statistics is not the same as it is in comp sci.

kennknowles commented 2 years ago

Yes, the wikipedia page on the stats use of the term lists the stats use of the term first :-)

I'm referring to the use of the term beyond stats, just to offer an outside perspective FWIW. The "comp sci" use is much more widespread than computing, in my experience. But I'm certainly not advocating for that use, either. It is also insensitive and not descriptive.