Add libsvm format bag of words dataset

lazywei commented 9 years ago

Hi,

I'm trying to implement some other classifiers as mentioned previously (e.g. #2205). Now I've created a dataset in libsvm format (training from samples/, testing from public gist). The reasons are:

libsvm format is efficient for sparse data in terms of file size.
A general dataset is useful for implementing / evaluating new classifiers. (current samples.json is restricted to Naive Bayesian Classifier)

However, I'm not sure if you guys are OK with the following changes, so I'd like to open an issue for discussion before I make the PR:

I've created two files training.libsvm and testing.libsvm that can be used for testing -- should I commit these two files, or should I commit the script that generates these two files? (I saw samples.json is ignored, though)
I've made the rake test:classifier be able to save the public gists it used -- so we can use them to build bag-of-words for (1). Should I commit this change? Or should I leave the "downloading public gists" task to you guys as you can use internal access?

Thanks.

pchaigno commented 9 years ago

I'm not sure I understand what you did here. Can the two files be used instead of samples.json with the current Ruby code?

lazywei commented 9 years ago

I'm not sure I understand what you did here.

Basically, I'm trying to create the data files in a more general format, that can not only be used in Naive Bayes but also other machine learning algorithms.

Can the two files be used instead of samples.json with the current Ruby code?

Yes, they surely can, they are more general in the sense of they have explicit tokens count for per file (the samples.json contains tokens count for per language instead of per file).

pchaigno commented 9 years ago

Sounds like a good idea for your research on alternatives to NB.

But I don't understand why you want to add them to this repository. Would it improve the memory usage?

lazywei commented 9 years ago

But I don't understand why you want to add them to this repository. Would it improve the memory usage?

That's what I'm wondering exactly. Should I commit the data file into the repo, or should I commit the program that generates the data files? It will not directly improve the memory usage, unless we implement algorithms other than NB. And to implement other algorithms, we will need this file as samples.json can not provide enough information for us to train a general classifier.

pchaigno commented 9 years ago

It will not directly improve the memory usage

It that case, I don't think you should add the files nor the program that generates them in this repository. They are only useful for your research. Only the final result of your research (i.e. the new algorithm if you find a better one) should be implemented in Linguist.

lazywei commented 9 years ago

Without these files or the program generating them, other developers can't implement new algorithms. In fact, this is not only restricted to my research. Also, even if this is for my research, other developers will not be able to execute the algorithm if they don't have these files. It's just like you need samples.json to execute NB.

pchaigno commented 9 years ago

Well, could you make a gist with the program and add a link to it in this issue? You should probably also mention them and add a link in your application for GSoC.

I really think it's a good idea and a necessary step for your research but we shouldn't have research scripts in the production gem.

lazywei commented 9 years ago

Wow, that's a good idea, thanks. Here is a draft of code that builds bag of words. I haven't carefully tested it though. Also, I rewrite the Samples.each a little bit to support custom samples folders. The gist is here: https://gist.github.com/lazywei/bedeeb1d71d2ad9f21ba

I think I might need to rephrase a little bit. This file should be able to serve as the same role as samples.json. In other words, we can use this file for NB. Also, with this file, other developer can also try to implement new algorithm. I agree we shouldn't put experimental related files into production gem. However, I also think we should provide a convenient way for developers to develop new algorithm. How do you think?

Thanks!

lazywei commented 9 years ago

Implemented in mockingbird.

github-linguist / linguist

Add libsvm format bag of words dataset #2237