Open sampathweb opened 9 years ago
Ramesh,
Going to post the data here tomorrow. Is a sample OK? I think I read Alex mentioning that a sample csv in github is fine if the data set is large.
Breaking it down into a 4-part problem based on your feedback. If I can succeed in the first two, I'd consider it a success, with the latter two being extra credit in case I happen to blow through it really quickly.
1) Define clusters -- definitely agree on using PCA and other techniques to reduce the # of dimensions first.
The primary challenge for me is drawing these clusters in a lat-long space, but having the clusters be defined by a couple other dimensions like price per sq foot and # of bedrooms. Is this feasible? Some sort of projection of clusters that live in n-dimension space onto a lat-long plane? I'm also curious on how best to draw these clusters... would GeoJSON work for a typical mapping project?
Also, does this mean if I find, say, the 'Atherton' (near Palo Alto) and 'Piedmont' (in Oakland) clusters are similar, would they be 'Cluster 1A and Cluster 1B', or would they be part of the same 'Cluster 1', but the projection into lat-long space results in non-adjacent clusters? I would like to find out if k-Means can produce a cluster made up of several non-adjacent sub-clusters.
Basically, I'm considering either using lat and long as additional dimensions for the clustering (i.e. geographical proximity/clustering is important) or purely using home features only (# bedrooms, price per sq foot, lot size etc) for the clustering. In either case, the end-result is drawing the clusters in a lat-long plane.
Yes, primary intent is an inference exercise, to identify clusters in data. I could probably convert it to a supervised learning classification problem by training the data with user-defined neighborhoods/clusters (i.e. these two/three/n homes are in the same neighborhood because users tend to search for those two/three/n together) if this sounds much more doable than the clustering problem?
2) Text-mine listing descriptions and/or geo-tagged social media posts to describe clusters -- I'm fascinated by how apps like Yelp, Glassdoor, Amazon, LinkedIn etc are able to pull out the 'key point/insight' in customer reviews, and would like to replicate that.
3) Use restaurant & bar data as another dimension in determining the clusters and/or as the labels instead in a classification exercise -- I basically want opening hours and type of establishment (luxury, dive etc). I could pull this from Yelp, Google Maps or OpenTable's API if available, or just our company-purchased restaurant map layers if that becomes too gnarly.
4) Understand how the clusters evolve. If whatever I did for 1) didn't produce recommendations or a similarity distance metric, then also do this (i.e. 'if you like homes in Nob Hill cluster, you'll also like this random neighborhood in San Jose').
I'm very comfortable and have some fair experience with regression and classification techniques, and hence prefer NOT to scope this project around those since I can do a regression/classification project independently. I view this project as a challenge to explore what I DON'T dare to do: clustering and dimension-reduction in 1), text-mining in 2), API pulls in 3). I feel these were and are my blind spots entering the class, and I hope to knock out at least 2 of the 3.
This all sounds good. Pull some sample data and explore using plots. Thanks for detailed notes on where you want to go with the project.
Thanks, Ramesh Sampath
On Feb 10, 2015, at 12:22 AM, selwyth notifications@github.com wrote:
Ramesh,
Going to post the data here tomorrow. Is a sample OK? I think I read Alex mentioning that a sample csv in github is fine if the data set is large.
Breaking it down into a 4-part problem based on your feedback. If I can succeed in the first two, I'd consider it a success, with the latter two being extra credit in case I happen to blow through it really quickly.
1) Define clusters -- definitely agree on using PCA and other techniques to reduce the # of dimensions first.
The primary challenge for me is drawing these clusters in a lat-long space, but having the clusters be defined by a couple other dimensions like price per sq foot and # of bedrooms. Is this feasible? Some sort of projection of clusters that live in n-dimension space onto a lat-long plane? I'm also curious on how best to draw these clusters... would GeoJSON work for a typical mapping project?
Also, does this mean if I find, say, the 'Atherton' (near Palo Alto) and 'Piedmont' (in Oakland) clusters are similar, would they be 'Cluster 1A and Cluster 1B', or would they be part of the same 'Cluster 1', but the projection into lat-long space results in non-adjacent clusters? I would like to find out if kNN can produce a cluster made up of several non-adjacent sub-clusters.
Yes, primary intent is an inference exercise, to identify clusters in data. I could probably convert it to a supervised learning classification problem by training the data with user-defined neighborhoods/clusters (i.e. these two/three/n homes are in the same neighborhood because users tend to search for those two/three/n together) if this sounds much more doable than the clustering problem?
2) Text-mine listing descriptions and/or geo-tagged social media posts to describe clusters -- I'm fascinated by how apps like Yelp, Glassdoor, Amazon, LinkedIn etc are able to pull out the 'key point/insight' in customer reviews, and would like to replicate that.
3) Use restaurant & bar data as another dimension in determining the clusters and/or as the labels instead in a classification exercise -- I basically want opening hours and type of establishment (luxury, dive etc). I could pull this from Yelp, Google Maps or OpenTable's API if available, or just our company-purchased restaurant map layers if that becomes too gnarly.
4) Understand how the clusters evolve. If whatever I did for 1) didn't produce recommendations or a similarity distance metric, then also do this (i.e. 'if you like homes in Nob Hill cluster, you'll also like this random neighborhood in San Jose').
I'm very comfortable and have some fair experience with regression and classification techniques, and hence prefer NOT to scope this project around those since I can do a regression/classification project independently. I view this project as a challenge to explore what I DON'T dare to do: clustering and dimension-reduction in 1), text-mining in 2), API pulls in 3). I feel these were and are my blind spots entering the class, and I hope to knock out at least 2 of the 3.
— Reply to this email directly or view it on GitHub.
Ramesh, did this count as my final proposal, or am I supposed to do something else? I didn't see anything else to submit. Created a new repo for the project: https://github.com/selwyth/neighborhood
Your Project Desc: Using my company's real estate listings data/history for San Francisco and East Bay, attempt to define 'neighborhoods' in terms of similar-price clusters. Augment this with text mining based on listing descriptions or geo-tagged social media posts to describe these clusters and allow humans to make sense of them (e.g. "Oh what we think of as Nob Hill should have its western boundaries extended if we think in terms of home price affordability"). Then, for extra credit, attempt to predict future gentrification, trending neighborhoods and/or pockets of home price growth with the help of map layers (e.g. restaurants, crime, walkability, bike score, bars, demographics) from 3rd-party APIs or company-purchased data.
Initial Feedback: David, Looks like you have the data from work and the text data to learn more about the properties. Integrating that with third-party API's is generally trickly and time consuiming. So, I would probably focus primarily on what data you already have and mixing that with the text analysis on descriptions about the property.
Unsupervised learning (Clustering) is a hard problem in high dimensions. We can certainly try PCA and other dimension reduction techiques so that we can visualize the data points in a 2-D space. Unsupervised is also makes it difficult to test your model. Do you have any Supervised learning you want to do from the data or is the intend is primarily to identify clusters in data?