Collaboration on Boston Housing case study?

kevinrobinson commented 4 years ago

hello! I found out about this work from your conversation in https://github.com/fairlearn/fairlearn/issues/406, so thanks for bringing it up over there 👍

I see some work on https://scikit-fairness.netlify.app/fairness_boston_housing.html, which seems great in that it makes an attempt to include some wider sociotechnical scope. It just so happens I started working on something similar this week! :)

One of my own assumptions is that no existing tools embrace the sociotechnical aspect of fairness work, and a key challenge is showing what this kind of interdisciplinary work actually looks like. I'm hoping that working in a concrete scenario could help clarify where existing fairness tools encourage practioners to avoid the sociotechnical nature of the work, and instead find some promising new directions to explore for design and research.

In research-y terms, given existing research-practioner gaps in fairness tools (eg, Holsten et al 2019; Maidado et al 2020), what kinds of tools can we explore that would move us closer to embracing the sociotechnical nature of ML fairness work (eg, Selbst et al 2019; Seo Jo and Gebru 2019)?

I've started working on this just on my own, and I picked the "Boston housing" dataset because it's widely used for ML education, deals with an extraordinarily contested sociotechnical context, and I couldn't find any educational material online that even acknowledged that. This kind of blew my mind :)

If you're interested, I'd love to chat and see if there are ways to collaborate on this!

koaning commented 4 years ago

You may enjoy this: https://youtu.be/Z8MEFI7ZJlA?t=766 it starts with boston and sort of ends with the idea we had for this project.

koaning commented 4 years ago

Also relevant to mention: if I recall correctly the boston dataset is now being removed from scikit-learn due to the controversy.

koaning commented 4 years ago

Also, we figured it relevant to not just have boston as an example. There's some other datasets relevant too.

https://scikit-fairness.netlify.app/api/datasets.html

The coolest feature of these functions? When you load them, they raise a FairnessWarning.

kevinrobinson commented 4 years ago

@koaning ah thanks! I didn't know about this talk, and don't know about any controversy either so would appreciate other links :) If I can Github properly I can also see that this is your talk, so would definitely love to chat more!

I think there's a lot more to the sociotechnical context in Boston in when the Harrison and Rubinfeld (1976) paper was published :) I think it would be interesting to explore what tools or talks would look like if they embraced that sociotechnical complexity, even if that made the work significantly more interdisciplinary and challenging. While it's easy to remove a column from a dataset, it seems to me that exploring race is a critical part of working in that particular sociotechnical context (eg, https://duckduckgo.com/?q=housing+boston+1970s to start). Rather than being problematic, it seems full of educational potential.

I can share what I'm working on and see what you think, but I'm also trying to be a respectful contributor and happy to help iterate forward from https://scikit-fairness.netlify.app/fairness_boston_housing.html as well. No worries if not, and thanks for sharing your awesome work either way :)

EDIT: To use the language from the end of your talk, what would it look like if we built tools for those tailors to make suits in different sociotechnical contexts? I'm optimistic that intellectual traditions and approaches from other disciplines can inoculate against some of these kinds of over generalizations or "abstraction traps."

koaning commented 4 years ago

Here's the discussion on boston housing in sklearn.

My main gripe with the boston housing dataset is that it is used so often (here's an example) as a dataset to explain how to optimise for mean squared error without spending any effort in just looking at the variables. As I've mentioned in my talk, this is artificial stupidity bound to happen. We can't have people design systems this way. But I digress ...

When we started with scikit-lego it was the obvious candidate to tackle in the documentation because it is such a well known dataset. It got us some "shock value" which made people listen and may have even gotten some of them to think again. In my mind, this method of education is part of what industry practitioners need; a story they will remember that may make them think again. (I'm a huge nerd on this topic by the way; exibit A, exibit B, exibit C). But I digress ...

That said, I think that in the current docs we're over-using it. All the fairness methods from the lego project are used on the dataset which puts the focus on the methods. I think the dataset deserves some historical context and I'd argue it'd be super beneficial to explain how that dataset got created and also how it got so popular. If this is what you refer to with sociotechnical context then I highly agree!

koaning commented 4 years ago

In terms of collaborating on boston, I'm certainly open to the idea but I prefer to wait until we've fully checked in with the fairlearn project over here.

kevinrobinson commented 4 years ago

@koaning Awesome, thanks for the scikit-learn links! These are super helpful to read through, and I hadn't seen any of that searching around online. I'm in strong agreement about it being unfortunate there are so many "grab data, run three lines of code" examples, and would love to explore tools that would encourage deeper ways to approach datasets. I also agree that this is fundamentally an education problem, and love your drawings and stories in the Goodhart, Bad Metric post (although those all linked to the same post, I read around some of your other great talks too).

I hear that you would like to remove the Boston housing dataset and aren’t interested in working on it further, so I’ll close the issue. I'm watching the repo, so if there are other ways you decide to move forward, I'll be curious to see if there are ways to collaborate. 👍

Separately, I think we’re interpreting “sociotechnical context” in different ways, so thought I'd share a bit about that. It might be that this is different than what you're interested in exploring in this project; and folks have different perspectives on what matters to them about fairness :) I hear you using "sociotechnical context" in the sense of “the history of how engineers have come to use the data set, why it was become commonly used, etc.” That is interesting for sure! And I agree there are a bunch of methodological issue encoded in the dataset (eg, Giley and Pace 2016; Pace and Gilley 2017; Bivand 2015; Bivand 2017)), and it might be interesting to explore the sociology of how such a dataset has spread. But I mean "sociotechnical context" in the sense of “fairness is not a narrow technical question," in a way that I think is similar to the spirit of a lot of your talks :) So when I'm talking about tools that embrace sociotechnical complexity, I mean exploring how tools could support people doing that work.

One path might be exploring ways to more explicitly model the sociotechnical context, dynamics and feedback loops (https://github.com/google/ml-fairness-gym, https://github.com/zykls/whynot), and use those tools to explore "applications, not models" . Another path, the one I'm thinking about here, is more about how tools could help people without knowledge and expertise understand the social and historical context of the domain they're working in. In the same way you've shared some amazing stories in The Profession of Solving the Wrong Problem, are there ways to build guardrails or on-ramps that support exploring more about the sociotechnical context the dataset itself is embedded within? I see this as in the spirit of what Selbst et al (2019) describe in longer form. Of course, none of the tools are as good as things like building knowledge, gaining experience, listening to talks with memorable cautionary tales, or collaborating with senior colleagues modeling good practice :)

Anyway, I'd be happy to brainstorm and chat more if you're interested, and I'm glad to have discovered this repo and your interesting talks and slides. I'll keep following along, and thanks! :)

koaning commented 4 years ago

@kevinrobinson to clarify, I think removing boston from scikit-learn is a good thing, but I intend to keep it in scikit-fairness/scikit-lego. It is a dataset that serves great as a fairness example and it is therefore useful for some basic benchmarking of fairness.

I'll re-open the issue :) but we should still wait until the discussion with the fairlearn project have matured.

koaning commented 4 years ago

fairness is not a narrow technical question

Ahh, you meant that angle. This is certainly valid too. I suppose that my attitude here is that all the tools that we provide can be part of a solution but the tools themselves will not understand the problem for you. Especially in the case of fairness issues I wouldn't want to suggest more than a remedy eiter. The "are we solving the right problem" and "are we introducing a new problem by solving the original problem in a certain way" issues will never by a few mere lines of python code.

kevinrobinson commented 4 years ago

@koaning gotcha! Yeah, I'm trying to discover if there are open source communities working on fairness that have interest in exploring more interdisciplinary approaches. There is a lot of social and historical context around housing in Boston in the 1970s that are very relevant to fairness questions. Of course this context is more complex than a few lines of code :)

This writing resonated with me, particularly:

It takes a team a lot of time to realise what problem it is actually solving. Getting this right is hard. It’s even harder when the majority of the team consists of the same people. Worse; these people are trained in machine learning and prefer to keep themselves to the algorithmic part of their work. ... If we want this field to survive a winter (or an economic downturn) it might help if the field gets better at recognising that we need to worry less about the algorithm and more about it’s application in a system. To do this we need more than mere “data scientists”. We need engineers, managers, user interface designers, web programmers, domain experts and people who can come up with ways to improve a business by thinking creatively.

The future of our applied field is not in our new algorithms. It is in remembering all the old things we used to do with algorithms before there was hype.

So as a thought experiment, what critiques might that interdisciplinary team make of https://scikit-fairness.netlify.app/fairness_boston_housing.html? I understand if that's not a direction you want to work in, and no worries if there is no overlap :)

koaning commented 4 years ago

Nice! I am very happy to hear that you enjoyed that part in particular. I think it's a key point.

I actually think that these discussions can be summarised and hosted on the documentation. I'd argue though that at some point you could write a book on the topic as well. Certain stories should be shared on the docs, others are better served in a book.

So ... let's indeed brainstorm on that team. Imagine that we'd do the boston housing dataset but that there was not just computer science graduates but also;

an actual realtor in the team, with domain knowledge who may point out the many ways in which the data is not enough to actually do a houseprice
somebody who is from boston who can look at a map, see the prices and confirm if it is sensible
somebody with a legal background who can imagine if we're even allowed to use this dataset for the application that the team has in mind
somebody of color who might have experience with seggregated neighborhoods, this could be somebody who could speak up when there is a blindspot of priviledge

It's not too hard to imagine that the application of the machine learning model improves if you have different backgrounds contributing. It's just way less blind spots.

kevinrobinson commented 4 years ago

Yeah! This is kind of the direction I'm trying to think on, thanks for playing along :) I'm sort of wondering if we can manually sketch out what that kind of team would do, embrace that work being interdisciplinary, sociotechnical, and not rooted in Python code alone. And then look back and have some super productive critiques where we ask "how meaningfully would tool X contribute to this team's work?" I'm hoping that can be 1-3 pages max, sort of similar length to a Jupyter notebook tutorial or short blog post.

I also think that the particular historical and social context of the Boston area in the 1970s is very much not a neutral example :) There was national legislation on housing discrimination passed during that time period, constitutional challenges, consent decrees aimed at residential desegregation, and protests and violence. It was all very contested.

If we were going to make a scenario with that dataset, I also think we'd need to figure out what the "team" is asked to do in that scenario. The original Harrison and Rubinfeld (1976) is using the dataset to estimate the price of air quality. In all the tutorials I've seen, the dataset is used to predict the median owner-occupied home in a census tract, and I've always guessed that must have been been similar to how the dataset was used in "4.4 Robust Estimation of a Hedonic Housing-Price Equation" in the Belsey et al. textbook (1980). But I can't access that source so I'm not sure. Not sure we need to stick with the historical usage of the dataset, but I do think it needs to be a real scenario if we want to provide a meaningful alternative to "lol 3 lines and I can haz ML" :)

If we were actually going to estimate house prices, I think the scenario would need to motivate why we are doing this at a level that's aggregated to census tracts (bizarrely, no tutorial I've seen even addresses this basic question, although some academic papers do). And also it would need to articulate where we hope to use this model (eg, is the aim to make a model that generalizes to 1980 census data)? Of course by asking those questions about application and purpose, we introduce a lot of messy details :) And it's likely that our interdisciplinary team would end up recommending that we not use this dataset or take another approach altogether and use different data sources, etc. I think it's fine if that does happen, as it might make for a truly valuable cautionary tale, in the style of what I see in your talks. And I haven't even gotten into how such an interdisciplinary team might contest what's happening in terms of fairness broadly.

Separately, we could then ask if there are ways to explore tools that would help surface the kinds of issues that our interdisciplinary team highlighted (eg, Madaio et al. 2020; Raji et al. 2020). The scenario serves as a "test case" for those different methodologies. Of course this is all a bit of a simulation or fiction and that is quite strange :) It would be much better to do this in collaboration with real interdisciplinary teams, but unfortunately I don't yet know of groups that area actually working that way :) So this is sort of trying to prototype concretely what that kind of interdisciplinary collaboration would look like, as a way to pave a path for that work to become more possible, a kind of rigorous speculative fiction perhaps :)

Anyway, I'm happy to brainstorm more either on the substance or format. As a terrible draft, we could:

pick a problem scenario
make a fictional interdisciplinary team like you outlined
write out what that interdisciplinary team would have to do to truly inspire us about excellent work in ML fairness
critique what we see: how well existing tools would serve their needs, what opportunities there are for new kinds of fairness tools, how well this aligns with HCI research.
iterate on it, or try another kind of scenario

I'm happy to try to help with something more concrete if you have a different idea in mind as well :) Thanks! 👍

koaning commented 4 years ago

You raise some interesting things here. I did not know that the dataset was originally meant for measuring clean air. That really puts a perspective on things.

A few things that pop to mind.

The census is different in different countries, I believe it is very uncommon outside the united states to log the race anywhere. This means that the problem may not translate well. I am aware of efforts in Germany where they use zipcode as a proxy though. This does not capture unfairness based on gender but could plausibly server as a proxy for income.
It definately seems worthwhile to write a blogpost on this item and it seems worthy of a conference talk (lightning talk at least). The fact that it was meant to measure the effect of air quality is really news to me.
It may be hard to proper generalize it into action though. It seems plausible that some of the lessons here are going to be hard to translate to something in practice because a data team at company X may not have the same resources/experience required to tackle the issue properly. I think it's fair to state that I'm an observer with the benefit of hindsight who is not financially involved. The experience is bound to be different when you are not in this position.

Out of interest, are you aware of models that are in production by people that you personally know that make use of fairness? How many data teams do you know that are truly interdisciplinary? I've been a consultant for six years (granted in the Netherlands) and have never met a true mixed team. Instead I (or sometimes team-membmers) always tactically hung out at the watercooler to ask for other peoples opinion (which they always gladly gave) as a proxy.

koaning commented 4 years ago

It might be relevant to add as context. I host a second blog over at calmcode.io and I've been meaning to add fairness content to it. Before starting a proper chapter on scikit-learn there I'd love to be able to refer to fairness tools. I'll gladly mention any blogpost that discusses the historical context.

kevinrobinson commented 4 years ago

@koaning Awesome! Yeah I think the point you raise that "housing models based on US census data are unlikely to generalize to other places outside the US" sounds exactly like something I would be so excited to see in an example of ML fairness work :)

I also think your point on how to frame the thoroughness of the example is important, and more generally that it's important that examples can productively model how practioners can take incremental steps towards raising the rigor and quality of their work. In the same way that some engineering work doesn't really need tests, but some other engineering work needs exhaustive testing, it's helpful to illustrate a range of approaches and how practitioners can move between them.

I think this speaks to your point about whether any data teams are truly interdisciplinary, and of course my understanding is that this is very rare :) In my own experience I have done this in limited ways, but that has been on work done outside of typical industry settings. For me the point of this kind of work in developing tools is to help folks raise the quality and rigor of their work. Part of that is meeting their stated needs, but part of that is also demonstrating a new way of working, along with a memorable story and iterative steps that seem possible to take right away.

kevinrobinson commented 4 years ago

Anyway, I'll put together a draft on the Boston and Ames housing datasets, in the spirit of what we've chatted about here. If you're up for reviewing that would be super helpful and awesome! 👍

koaning commented 4 years ago

Sure thing, ping me when you get around to it :)

koaning / scikit-fairness

Collaboration on Boston Housing case study? #31