spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.33k stars 645 forks source link

Software shouldn't guess gender #117

Closed ghost closed 8 years ago

ghost commented 8 years ago

People shouldn't guess gender and gendered pronouns. Neither should software. Would you write code to guess someone's race using heuristics?

PaulBGD commented 8 years ago

There's a lot of uses for guessing gender, even if it's not for using to talk to the user. Using guessed gender could be useful for tracking, better search results, and better analytics (Google Analytics tracks gender).

Drachenkaetzchen commented 8 years ago

@PaulBGD I feel offended by the algorithm.

komali2 commented 8 years ago

@scanlime Surely we are not letting this movement come this far. English has gendered pronouns, would it not make sense for a natural language processor to recognize that and utilize it?

Identifying a gender (or race, or religion) shouldn't be considered offensive, and that's all this program does. This program doesn't create glass wage ceilings, or cause a disparity in the number of female engineers in the tech industry, or promote violence against other races. It processes language. A language with gendered pronouns.

Maybe a better question - how would you write a Spanish or French language processor?

ghost commented 8 years ago

Okay: Consider how your algorithm would work when it gets the guess wrong. Consider whether it would be better to use an ungendered option like "they". I understand that human language includes a whole host of assumptions that encode cultural biases- this just seems like one area where it's easy to remove gender, and the benefits of leaving gender in are mostly to serve those for whom gender is purely an advertising marker and not a life or death matter.

PaulBGD commented 8 years ago

@scanlime You're suggesting that a single use case means that the entire feature should be removed. There are many, many reasons to try to guess gender besides displaying it to a user.

ghost commented 8 years ago

@PaulBGD You misunderstand the scope of my complaint. I'm not saying don't display the guess to the user, I'm saying don't guess at all. I don't see how one can make non-harmful marketing or search decisions based on guessed gender. Can you elaborate?

nbarbettini commented 8 years ago

Would you write code to guess someone's race using heuristics?

In the context of a NLP, I don't see how this would make sense. English doesn't have different pronouns for addressing races, but it does for addressing gender. Whether it should be that way is a different argument, I think.

Consider whether it would be better to use an ungendered option like "they".

I agree that this is better in some cases as it's more general, but it adds some complexity because it's also plural.

vanita5 commented 8 years ago

@PaulBGD but why do you need to guess gender if you can use ungendered language? Besides being more inclusive you don't even need an algorithm for guessing.

PaulBGD commented 8 years ago

Can you elaborate?

I listed a few examples in my first comment. A lot of tracking reasons come immediately to mind.

nonchip commented 8 years ago

There's a lot of uses for guessing gender, even if it's not for using to talk to the user.

which? that don't break as soon as you let them out in the wild.

Using guessed gender could be useful for tracking,

nobody likes skyblue/pink marketing, so nope.

better search results,

how's that? only show trucks to guys? yeah no.

and better analytics (Google Analytics tracks gender).

and fails to do so. as does twitter analytics (they actually flipped what they think would be my gender multiple times in a week).

A lot of tracking reasons come immediately to mind.

which ones?

ghost commented 8 years ago

I'm trying to imagine a use for gender in search... the only thing that maybe comes to mind is searching contacts, but then it'd just make it harder to find all my genderqueer friends (or just people who aren't on the name whitelist heh).

nonchip commented 8 years ago

I'm trying to imagine a use for gender in search

except for dating websites, which are broken by design anyway (see "meat market"), there's no (non harmful) use case i could imagine.

ghost commented 8 years ago

@nonchip In cases like that (dating, specifically gender-related sites) it would be wise to just let people specify a gender in some way that's appropriate for the platform

nonchip commented 8 years ago

but actually, the "laugh emoji" reactions to @felicitus' post already shows us the error cause here: layer 8.

nonchip commented 8 years ago

@scanlime that's exactly my point. (also what i'm doing: using "varchar(256)" string inputs and autocompletion for suggestions, but even the dropdowns/radiobuttons you usually see are broken by design, and it can't get any better by letting some algo try and guess)

if interested, the relevant tags for HTML based UIs would be: <input type="text"/> and <datalist/> (http://www.w3schools.com/tags/tag_datalist.asp; see it in action: https://github.com/kinky-eureka/community/blob/master/views/profile_edit.moon#L36)

nbarbettini commented 8 years ago

Maybe I'm missing something here: AFAICT, this is an optional minor feature of a broad library that does natural language processing. I would expect it to be wrong sometimes, in the way that machine "understanding" of text and society often is. Why is it a problem if this library offers pronoun guesses as an optional feature?

nonchip commented 8 years ago

@nbarbettini the problem starts with having this folder altogether: https://github.com/nlp-compromise/nlp_compromise/tree/master/src/data/names

guessing gender by name is so very broken by design, even if someone really thinks "male" and "female" should be the options to consider (in which case I'd advice them to take a look at what's called "reality").

also: https://www.youtube.com/watch?v=46ehrFk-gLk

jbilcke commented 8 years ago

I tried to write code to detect the gender using a self-compiled database of census / statistics from various countries

My feedback, and why it was futile to do so for my project:

The whole approach feels like a random guess, with little practical value, especially if you just return a boolean value (except the risk of offending someone if you do something like automatically allocating a color, avatar or something)

At best, it could return something like a percentage of each gender, but then you would need real data.

The results would still be biased, but if you are studying a dataset, understand there is an unknown percentage of error, and are honest with the way you communicate about the guessed values, it could be useful for research.

nonchip commented 8 years ago

@nbarbettini also, about the "optional minor feature": that's true, but also what this ticket is about. as far as i can tell nobody wants to burn this project to the ground because of it. it's just that this one feature is a) broken by design (by the way language and people work) and b) especially broken in the way it's implemented (using 2 hardcoded lists to decide), as @jbilcke points out pretty well.

nbarbettini commented 8 years ago

@nonchip Gotcha, thanks for elaborating.

nonchip commented 8 years ago

(and, sadly, seems broken by opinion too: any benefits of gendered marketing/statistics tracking/etc are purely made up in the mind of people trying to do marketing/statistics without knowing how people and the real world work :P)

see also: https://www.youtube.com/watch?v=3JDmb_f3E2c

nonchip commented 8 years ago

imho, any use of "gender categorizing" except the statistics @jbilcke mentioned above (and even those only if you take the result with a few grains of salt) is just making up random values with no real meaning whatsoever.

btw I'm binary gendered myself, I just seem to know how people work better than those marketing people who think there'd be any benefit in using it :P

sotojuan commented 8 years ago

How about an option flag that turns gender guessing on or off?

rpearl commented 8 years ago

An option flag isn't relevant--you can already choose to use this feature or not. The issue is that having this feature at all is just fundamentally broken. Choosing to use it gets meaningless, noisy results.

nonchip commented 8 years ago

also, even offering the feature puts you into a hell of problems, because you simply can't get it right. so anyone who is "fooled" into using it by it's existence is being put into a lot of trouble ranging from annoyed users who got guessed wrong, to annoyed bank accounts because you based your marketing on wrong data, to confused marketing folks themselves because they're getting utterly broken statistics.

it's like offering people little sugar balls you dropped a single molecule of some random plant extract into and telling them they'll feel better if they take it. oh wait. :P

rblalock commented 8 years ago

There's a really easy solution for this: For those not interested in a scientific approach to writing software....fork it and write your own....I'm sure there's some use somewhere out there for a computer program that can't identify things based on data.

nbarbettini commented 8 years ago

@nonchip I agree, although I can understand why a NLP library would try to offer it even if it's hard/impossible to do.

nonchip commented 8 years ago

@rblalock are you suggesting having 2 lists of names and trying to use them for anything useful is a "scientific approach"?

you know, science is used for understanding the world around us? so how can ignoring the world around us and inventing a pretty bad RNG based on usernames being "scientific"?

nonchip commented 8 years ago

@nbarbettini i can also see why it could/would try it (from a language analyzing POV), but i doubt it can be done at all, and i know it can't be done the way it's done here: by claiming there'd be exactly two genders and there'd be no people with names from the "wrong" list. also, listing names is a pretty bad idea, cause a) there are REALLY MANY of them, b) their use changes frequently, c) there are nicknames, pseudonyms and abbreviations, and that's where the real fun begins.

jbilcke commented 8 years ago

I can hardly think of // if it ends in 'oh or uh', male as a scientific approach based on data, contrary to, perhaps, large scale survey where one would ask people their name and their identified genders (even if they are biased, eg. only proposing 2 categories):

http://deron.meranda.us/data/census-naming-method.txt

EDIT: to be fair, it is true that using rules is an elegant way of compression information, and nlp-compromise has to stay lightweight. It all depends on the kind of data used behind.

rblalock commented 8 years ago

Watson does it (and failed a few times on Jeopardy lol. Like anything else...it's a software program to make things easier for a developer). Facebook does it. Lots of computer programs do it. Watson even has gender detection based on facial recognition: http://www.alchemyapi.com/products/alchemyvision/face-detection

rpearl commented 8 years ago

Facebook lets you choose one of very many gender options--I think the list has more than 50 options. If I remember correctly, you can also choose pronouns separately. There's no inference going on--they're tracking people's choices, not making decisions for them.

evilscientress commented 8 years ago

Only because other programs do it, doesn't mean you have to replicate it. I'm offend by such programs because they get it wrong, and I guess that they get it wrong for a lot of female users in STEM. An sorry purely guessing gender on names doesn't work. The same name may be use for females and males in different cultures. And than we haven't event scratched the fact that gender is not binary.

ghost commented 8 years ago

I hate it when a computer tries to guess my gender. I'm binary-gendered, but they're often confused by me and it's an echo of the same confusion present in mainstream society. This is an aspect of human behavior that we don't need to lazily clone into our automated systems.

nonchip commented 8 years ago

also, facebook DOES NOT do it: anyone who didn't set their gender themselves is always referred to as "they". also mentioned in the video by tom scott linked above.

apart from that: if "facebook does it" would be an argument, we'd all be shouting racist crap at one another without being intervened but banned from everything by even mentioning the existance of nipples. while we'd be selling every bit of private information for free.

also, nice @masterbase mentioned STEM: another "great example" of how gendered stuff goes wrong: I was actually rejected from a "women in STEM" informational event at my university, because their stupid list got my gender wrong -_-

they seriously took a look at me (relatively long hair, not-too-invisible boobs, definitely looking "stereotypical more female than male") and told me i'd have to go home because their excel sheet says i'm male. no more questions, your honor.

vielmetti commented 8 years ago

Having a list of names is only one way to guess. Here's another:

Scott Pakin. Regular Expressions and Gender Guessing. In Computer Language Magazine, 8(12):pp. 59-68, December 1991.

pakin1991.pdf

Code below is in awk. It uses a master pattern to identify male/female and then a series of guesses to refine things. It does fail on my name (it thinks that "Eddie" is female, but "Ed" is male) and generally shows how fragile it gets as the rule set gets longer and longer.

gender.awk.txt

More generally speaking, the whole question of assigning a gender to a name is fraught with peril and subject to error. Some common names (Chris, Pat) are not gender marked as strongly as others. Nicknames and diminutives create uncertainty.

If you look at http://www.genderchecker.com/ you'll see a service that returns three values for names - male, female, and unisex. It claims to know 100,000 international names.

nonchip commented 8 years ago

@rubenwardy that's exactly what I do: gender as an free text input field, and "they" as every pronoun. works pretty well.

komali2 commented 8 years ago

A big reason that attempting to take an "agendered" approach to processing the English language is because it is not an agendered language. I applaud efforts to change how a language work (it'll take a titanic effort), but this is not an API for furthering the cause of LGBT language-changers. This is an API for processing the English language based on how it is currently used.

English use case (98% of English used in speech and writing) doesn't have people dancing on their toes to say "if he/she would like to go to the theater, then he/she will need to ensure to bring his/her," etc. "They" is the most common way to cast the net over gender possibilities, but it's still more common to simply say "he" or "she." The brute momentum of a thousand year old language cares not for LGBT sensitivity.

Results will return wrong for names like "Jordan" or "Sam," yes, absolutely. Further context processing will help improve that. Results aren't 100% for any aspect of this API. Should we lance the API then, until it's provably 100% accurate?

Blue/pink marketing is real, and effective. Activists attack it daily, but it exists, and it makes companies money. Companies will continue to do things that make them money, and so having a method to identify gender is useful to them. This is one such tool.

rpearl commented 8 years ago

This is an API for processing the English language based on how it is currently used.

Inferring gender of a name in the way this codebase does is not "processing the English language". This codebase has insufficient context to make these inferences with any accuracy. Nor is making these predictions necessary to understand the contents of a sentence.

Results will return wrong for names like "Jordan" or "Sam," yes, absolutely. Further context processing will help improve that. Results aren't 100% for any aspect of this API. Should we lance the API then, until it's provably 100% accurate?

Yes, this is what several people have been asking to happen (for this API call). This part of the API is small and optional... and performs poorly. Why have it?

Blue/pink marketing is real, and effective. Activists attack it daily, but it exists, and it makes companies money. Companies will continue to do things that make them money, and so having a method to identify gender is useful to them. This is one such tool.

Money doesn't align with social good, so always pick the money?

vielmetti commented 8 years ago

I'm looking at the unit tests, and see

https://github.com/nlp-compromise/nlp_compromise/blob/master/test/unit_tests/nouns/person.js#L61

which has a very small test corpus (about 20 names) and which surprisingly to me has more than "Male" and "Female". There is a third gender, marked as "null", which returns both for nonsense names ("asdfefs") and for names with gender ambiguity ("Jan" is the test case).

If you were to want to fork this code and change the behavior, I think you'd start at

https://github.com/nlp-compromise/nlp_compromise/blob/master/src/term/noun/person/gender.js#L5

and just do

   return null;

You should know that whatever code that calls this will have to deal with the null case, but it's also worth noting that it's not as well documented as it might be that a null might be returned by the .gender() call.

TrayvonMalik commented 8 years ago

The mistake that a number of commenters here are making is assuming that the activists lobbying to remove this feature from the project are arguing in good faith, rather than simply attempting to exercise power by imposing their worldview on the maintainers of this project. The only rational response is to first google the term "kafkatrap" to familiarize yourself with their tactics, and then close the issue and ignore further attempts to reopen it.

Ironholds commented 8 years ago

"The mistake...is assuming the [bug-openers] are arguing in good faith"

looks at GitHub profile

Hmmnmnmmnmn.

nonchip commented 8 years ago

@rubenwardy the primary problem is not that it would/could be abused; the problem is that it doesn't and can't ever work without asking the user directly by definition (insert some smart remark about human free will here) ;)

h4rm0n1c commented 8 years ago

As a contributor to a number of projects, I've seen this disingenous line of reasoning used by ideologues and "offendatrons" to bully other software projects into making pointless and silly changes.

The payoff for them is "we harassed this person into complying with our Point of View and requests, WE HAVE POWER!" (essentially, they're bullies who get off on using "mob justice" to enforce their views)

Once you give them what they want, they don't go away, they issue more demands, each new one more insane and illogical than the last.

Ignore this person, secure your accounts online in case they try to retaliate, and weather the storm, they'll go away once they realise you're not going to indulge them in their childish demands.

Also beware of new faces and people in your online-sphere once you've made your decision on whether to accept this request or not, I KNOW I sound paranoid, my only goal is to not see yet another github project "taken over" by dangerous ideologues in sheep's clothing.

"This offends me, CHANGE IT!" has never been a valid reason for changing code, or any other creative work for that matter.

If you don't believe me, check out the Master/Slave incident on Django, or the "issues" that these people opened on the DICSS library (it's a css library that is made of dick jokes).

rpearl commented 8 years ago

"This offends me, CHANGE IT!" has never been a valid reason for changing code

Please actually address the points made by the people requesting this change, because that's not one of them. The points include but are not limited to: 1) It is impossible to infer gender with any degree of accuracy, and the attempts to do so in the code in this repo are very poor, yielding noisy and misleading results. The code is broken on several levels. 2) The API hands back a binary gender option with no confidence estimation or anything to imply that it is a fuzzy, biased assertion. 3) there aren't really valid use-cases that aren't better fulfilled by other analysis

This isn't a "childish demand"--but attempting to dismiss valid concerns by characterizing it as one instead of considering the substance of the argument is a pretty unhelpful way to contribute to the discussion.

h4rm0n1c commented 8 years ago

There's no need to address the points, if you don't like the implementation, make a better version and then make a PR.

But I know that the Social Justice version of this "disputed" functionality will either be: 1) Infinitely Worse, 2) A joke, or 3) Non-existent.

Or, you know... rejected outright as a pointless waste of time and effort.

Don't sit here and make demands of open source projects: do it yourself.

Also, I'd like to give a shoutout to Franz Kafka and the contributions his adherents have made to this conversation, namely: the traps.

Actually, while I'm here, do you really think the project admins are so stupid as to not notice a mob of nutters making issues like this?

What is this? https://github.com/nlp-compromise/nlp_compromise/issues/122

Personally, I believe in personal liberty, admins, do as YOU please, ignore everything here, even me if you wish, but don't bow to bullying or allow yourselves to be mislead by people with agendas that detract from actually writing code.

You move, SJWs.

dpyro commented 8 years ago

"The mistake...is assuming the [bug-openers] are arguing in good faith”

looks at GitHub profile

0 contributions Fake name Fake photo Registered today just to participate in this conversation Hmmnmnmmnmn.

Because people are getting tired of being no-platformed, censored, harassed, doxxed, and blacklisted from their jobs because they dared have a different opinion than some SJW crybully. Some of us are absolutely tired of Code-of-Conduct offendatrons who contribute nothing but discord and disruption to OSS projects.

Submit a pull request, or fork the project. Don’t sit crying that the English language operates in a different reality than the one you want.

ghost commented 8 years ago

I'm not going to bother responding to the ad-hominem ranters. Clearly this is just a humble issue request, and nobody has to do anything. If I were going to personally use this library, I'd probably submit a PR for this. As it is, I just came across this on HN and one of the examples caught my eye in that "uh, how could that possibly work?" sort of way.

Anyway, my concern has nothing to do with "social justice", whatever you think that means. It's a piece of code that seems unable to work, to support an API that seems to have only misguided uses. I'd prefer to push that toward a better future if I can. But hey, I'm just an internet rando, this is really up to the project maintainers of course.

h4rm0n1c commented 8 years ago

A humble issue request on a project that made it to the #1 rank on hacker news yesterday and has attracted similarly themed but poorly targeted versions of this request, all with the kinds of buzz-words that "that group" always uses when they're pushing the same old, tired, agenda.

Yeah, nothing untoward or disingenous going on here at all... I (LISTEN AND) BELIEVE YOU.

Your group has pushed this nonsense too far, people are noticing, that's why I'm here, I noticed the crazy and came to call it for what it is rather than let ignorance of these charlatans allow them to cause any more damage.

dpyro commented 8 years ago

I'm not going to bother responding to the ad-hominem ranters. Clearly this is just a humble issue request, and nobody has to do anything. If I were going to personally use this library, I'd probably submit a PR for this. As it is, I just came across this on HN and one of the examples caught my eye in that "uh, how could that possibly work?" sort of way.

“I don’t actually have a valid argument, so I’ll ignore inconvenient statements. Even though not everyone lives in my bubble where i can pick and choose my gender depending on my mood, I should be allowed to push my beliefs and ideas onto other people."

Anyway, my concern has nothing to do with "social justice", whatever you think that means. It's a piece of code that seems unable to work, to support an API that seems to have only misguided uses. I'd prefer to push that toward a better future if I can. But hey, I'm just an internet rando, this is really up to the project maintainers of course.

Look, if it doesn’t work for you, improve it, maybe just make a spec, do something. Don’t just drive-by concern-troll some random project that offended you and ask it to neuter itself only because it got popular on some hip site. You are destroying the good faith and good will extended to members of our communities by using a technical forum to push social beliefs.