Fix speeddating - Githubissues

janvanrijn commented 6 years ago

As raised by @mfeurer and discussed in the skype call this morning, the speed dating dataset should be fixed.

assigned myself for obvious reasons.

janvanrijn commented 6 years ago

The more I read in the Word document describing the features, the less I feel like being certain that this is an actual classification dataset. Some thoughts:

The match attribute is barely described.
There seems no clear correlation between the attributes in the version on OpenML and the version on Kaggle. Yes, some attributes are the same, but the names on the OpenML version are usually more descriptive and there are less (195 in orig, 123 in openml version)
Most features are about a survey that is filled in a single participants.
The participant rates his 'dates' also on these criteria. This seems to be the outcome of the paper (see abstract below)
For the description, to me it seems like each observation represents a single person filling in several questionnaires, and not a clear target to model.

Abstract from "GENDER DIFFERENCES IN MATE SELECTION: EVIDENCE FROM A SPEED DATING EXPERIMENT"

We study dating behavior using data from a Speed Dating experiment where we generate random matching of subjects and create random variation in the number of potential partners. Our design allows us to directly observe individual decisions rather than just final matches. Women put greater weight on the intelligence and the race of partner, while men respond more to physical attrac- tiveness. Moreover, men do not value women’s intelligence or ambition when it exceeds their own. Also, we find that women exhibit a preference for men who grew up in affluent neighborhoods. Finally, male selectivity is invariant to group size, while female selectivity is strongly increasing in group size.

based on this it seems like this dataset has been used for analytical purposes rather than classification and I would propose to drop it.

mfeurer commented 6 years ago

I agree on your proposition to drop this dataset.

openml / benchmark-suites

Fix speeddating #30