sixhobbits / yelp-dataset-2017

Submission to the Yelp Dataset Challenge 2017
15 stars 1 forks source link

Yelp Dataset Challenge: Our paper, tutorials, and blog posts

This repository and related resources, to which we link below, form our submission to the 2017 Yelp Dataset Challenge.

With new advances in machine learning and artificial intelligence, there has been a surge of talk about Democratizing AI. (For example, see https://news.microsoft.com/features/democratizing-ai/). It is important that everyone benefits from advances in this area, and not only huge corporations.

As part of this process, we believe that new research should be as accessible as possible, to as many people as possible. Therefore, instead of presenting our work in a single format, targetted at a specific audience, we instead present it in many formats, with different target audiences, including academics, beginners, and programmers.

Much of the presented research is related to authorship attribution. This is an interesting task with many practical applications (for example, detecting fake reviews, or deanonymising criminals online). However, many of the methods we use are generalizable to other text classification tasks, and, more broadly, to most machine learning tasks. We therefore present work not only related to authorship attribution, our field of interest, but also some introductory materials on machine learning and data visualisation in general.

Specifically, our submission consists of the following:

The code we used for the experiments in the paper can also be found in the other Jupyter Notebook files in this repository. However, these files are not well-strucutred nor well documented, and we do not recommend them as a learning experience.

We would like to thank Yelp and everyone involved in the Dataset challenge for providing this opportunity and dataset, with which we have had a lot of fun over the last several months.