I found the speed dating problem really intriguing. After examining the data, I am convinced that the data is sufficiently messy (with different scales, type, and missing data).
3 Things I like:
You made the hypothesis that certain factors will automatically influence the compatibility. It would be interesting to see which factors are truly important for a speed date.
You proposed how to pre-process the data such as rounding to the nearest 5. It's important to pre-process the data to normalize results and minimize noise.
You brought up the fact that the data might now be randomly selected. I think it's a valid point that this data was not collected from a 100% representative sample.
3 Things I noticed:
I was wondering if the metrics of the data is comprehensive enough to have a valid prediction. There are objective factors, but people also are irrational towards a relationship.
How would you pre-process and normalize the categorical data? It seems that some data is not very consistent (i.e. NYC vs New York).
How can you evaluate the design choice for pre-processing (i.e. Will rounding to the nearest 5 better than rounding to the nearest 10)?
Overall, I really liked this problem. I'm very excited to see the results!
Thank you for your feedback! We are definitely facing some data consistency issues and we're working to edit them or address the data inconsistencies on a new copy and also documenting our changes and justification.
I found the speed dating problem really intriguing. After examining the data, I am convinced that the data is sufficiently messy (with different scales, type, and missing data).
3 Things I like:
3 Things I noticed:
Overall, I really liked this problem. I'm very excited to see the results!