sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Group 14 - Project description #78

Closed andreaslangholz closed 8 years ago

andreaslangholz commented 8 years ago

How to get a place in Copenhagen?

An analysis of the market for rented apartments and houses in Copenhagen including topics as

The analysis will be based on a panel data set scraped from www.boligportalen.dk containing the variables rent, date, accommodation type, m^2, location, description, title and a dummy indicating whether or not the apartment has been reserved. If possible, this data could be merged with an administrative data containing information about the different building including year of construction, information on extensive renovations etc. or with a data set containing information on distance to public goods (transportation, recreational areas etc.) which might affect the price. Data.kk.dk has several data set on public goods (play grounds, cultural centers, lakes, green areas, use of streets).

2) Data limitations:

The apartments advertised on boligportalen.dk are obviously not random. Can we argue, that this data is representative for the apartment market in Copenhagen anyway? Moreover, we don’t know the actual rent, we only know the demanded price. However, as we scrape the data at different days and creating a panel data set, we observe which apartments has been removed from the page and which has been marked “reserved”. This might give us an indicator for which apartments that are quickly of the marked – that is, advertised at a price that renters are willing to pay – and which apartments might be too expensive.

3) Data cleaning:

As rent and kvm are numeric variables, and location and accommodation type are drop down lists on the web page, some of the data cleaning will be straight forward. If we choose to include a textural analysis of the title and description, the process will obviously be more complicated. If we want to merge the data from boligportal.dk with some kind of administrative data, we are going to spend some time making sure that the street names are spelled the same way in both data set.

4) Data Visualization:

Maps of Copenhagen showing difference in price and turnover seems obvious. Here we aim at doing both a mapping of general price levels and affordability, but possible also a mapping of our modelled predictions of changes in the prices.

Furthermore, graphical analysis of different plot types related to both data characteristics as well as the predictions models forecastings will be used in order to give a comprehensive story of rental market.

5) Statistical Learning:

Using a limited number of variables from boligportal.dk (postalcode, kvm, location) we could set up a relative simple prediction model designed to infer the relationship between the asked rent and this labeled data. If we manage to attach information on distance to public good or the condition of the building, these variables could be included.

A textural analysis of the relationship between the wording of the description and the rent could also be carried out. In this we would examine whether there are a specific wording used in description of highly priced rental apartments compared to lower priced, when other factors are controlled for using supervised learning algorithms such as the Ridge and Lasso.

Further, unsupervised learning models might prove beneficial to provide a depicturing e.g. the decision tree deciding the rent level of an apartment, or a dimensional collapse into an index for rental markets using principal components.