sebastianbarfort / sds

Social Data Science, course at University of Copenhagen
http://sebastianbarfort.github.io/sds/
12 stars 17 forks source link

Project description - group 7 #59

Closed neiljg closed 8 years ago

neiljg commented 8 years ago

title: "Project description"

output: html_document

Project idea

In this project we have chosen to consider the highly important economic factor of unemployment. A classic problem that has always impeded traditional unemployment statistics is the lack of dynamic perspective regarding the current unemployment situation. In Denmark, we are limited by the fact that our AUK-unemployment data[^1] is only published once every three months, at the end of each quarter, thus impairing any finer analysis of immediate unemployment. In this context, it is worth considering whether it might be possible to harness the enormous amounts of information present on the internet to somehow build up a better and more up-to-date model forecasting/predicting unemployment. The potential of such statistical enhancement is particularly interesting at times of economic crisis where traditional flows of information are too slow (lagged) in order to provide a sound basis for policy-makers implementing decisions. As mentioned earlier, traditional (official) statistical data is published only infrequently, but another problem is that the data also does not accurately reflect structural changes in the economy. In our project, we will be focusing on applying Google search data to establish or calculate more informative unemployment statistics. We hope that these variables will be useful for improving the unemployment forecast.

This approach has previously been employed in a paper by Nikos Askitas and Kalus F. Zimmerman Google econometrics and unemployment^2. Our contribution will be to see whether it is possible to improve forecasts of monthly unemployment by applying social data science tools and methods, facilitating the inclusion of up-to-date data sources, such as Google search data, in a Danish context.

[^1]:Danmarks statistik

Data collection

The project has two main data sources. The first one is Statistics Denmark, where the data on unemployment will be gathered using their API.

The second source is Google Trends, which can be scraped using the package "RGoogleTrends". There seems to be many different ways to approach the data scraping, so this is just one suggestion.

Furthermore, it might be useful to look at data from social networks. It could be scraping from a Facebook-group with people searching for jobs, the twitter feed with $#jobdk$ or maybe something from LinkedIn.

Statistical method and models

The main line of the statistical approach will be to use a time series ${yt}{t=1}^T$ of traditional unemployment statistics supplied by Danmarks Statistik. For the series under consideration we will choose an ARMA(p,q) model and use this model to forecast the series. For simplicity let us imagine that the chosen model is AR(1):

$$yt = \mu + \rho y{t-1} + \epsilon_t$$

Using Google Trends data resources we will then construct an index series ${xt}{t=1}^T$ which we will use in an expansion of the ARMA(p,q) model as an external regressor. In the case of the AR(1) we then have the model:

$$yt = \mu + \rho y{t-1} + \beta x_{t-1} + \epsilon_t$$

The statistical models will then be compared with respect to forecast capability - for example by using a Diebold Marino test. Another possible expansion would be to include a variance model in order to capture possible ARCH-effects and see how this influences the comparison of model forecasts.