shogun-toolbox / shogun

Shōgun
http://shogun-toolbox.org
BSD 3-Clause "New" or "Revised" License
3.02k stars 1.04k forks source link

Process small datasets for recommendation systems #1982

Open emtiyaz opened 10 years ago

emtiyaz commented 10 years ago

This task is for the Variational Learning for Recommendations with Big data http://shogun-toolbox.org/page/Events/gsoc2014_ideas#variational_learning

In this task, our goal is to get familiar with reading and processing data. This should be useful before we can move on to big datasets. We will also use these small datasets for debugging purposes. You can find example MATLAB code here. https://github.com/emtiyaz/recommedationDatasets The task is to write something similar within Shogun.

Please let us know that you are working on it, and feel free to ask any questions to @karlnapf or me.

karlnapf commented 10 years ago

An IPython notebook is the ideal output of this task. We can also start to play around with this data with the existing algorithms in Shogun so far.

k29 commented 10 years ago

May I work on this? As of now I am trying to implement the parallel of the matlab code in C++. How do I make it more generic? What do we mean by "write something similar within Shogun"? Also IPython notebook regarding what? Kindly help me out. Cheers!!

emtiyaz commented 10 years ago

By "something similar", I mean that you need to write it in C++ inside Shogun, so that later we can use algorithms implemented in Shogun on this data. This task is supposed to give you an idea of the kind of datasets, we will be using in this project. For simplicity, I have chosen small datasets, but the actual dataset that we want to use in the project, will be much bigger than these. I hope it is clear. Also, there is no need to make generic!

karlnapf commented 10 years ago

@emtiyaz Can we add those datasets as examples in our repository? Then we could make the task to do that, and to visualise them in a notebook... Since we do not really want c++ code that only can load one particular dataset. We have loads of readers for different file formats, so rather bring it into a form that we can process later.

emtiyaz commented 10 years ago

Agree, we should. I am writing some code for this data, which basically implements a simple recommendation system using GPs. The new task will be up soon!

karlnapf commented 10 years ago

Nice, that will be useful!

k29 commented 10 years ago

@emtiyaz, @karlnapf Hey I have worked on this, just couldn't figure out the location in the source tree. (as per our discussion on IRC) Here's a link to my repo: https://github.com/k29/recommendationDataset Do let me know the further steps, location to send the pull request and about the IPython notebook. Cheers!!

emtiyaz commented 10 years ago

That's great! @karlnapf is perhaps a better person to comment on further steps. However, you might want to also write code to process Movielens-100k data (see my repository for that). This will be useful for the next task where we will play with this data using simple recommender systems.

k29 commented 10 years ago

@emtiyaz onto it!!

karlnapf commented 10 years ago

@k29 Thanks for the link. I am a little confused though - you wrote this generic c++ code to read those files into STL data structures. First we do not use those data structures in Shogun, but our own. Second, as those files are just ASCII, you can just use the existing IO classes, such as CSVFile. With this one, it will just be a couple of lines to load the files.

An ipython notebook where you load the files would have been a bit more useful since this is what is needed to process the data further later on. Also, the ML here is being done from a notebook.

k29 commented 10 years ago

@karlnapf I have taken into account your above said suggestions and have come up with https://github.com/k29/recommendationDataset/blob/master/sushi_try.cpp, for the user and item metadata for sushi3 dataset. Kindly have a look and let me know. Cheers

karlnapf commented 10 years ago

Hi @k29 You wrote code that imports a CSVFile using Shogun. The goal however is to do ML on this data from an ipython notebook.

k29 commented 10 years ago

Ya, sure. Will move on to that, just checking whether I am on track for reading and processing data using shogun IO class and data structures as you mentioned, in contrast with my initial generic STL code?

karlnapf commented 10 years ago

Yes, that's how we do it, but this is nothing new, you are just calling a Shogun class (we have examples for that in fact) But yeah, that's like the first two lines of the notebook that this task is about.