Closed lazywei closed 9 years ago
I'm not sure I understand what you did here. Can the two files be used instead of samples.json
with the current Ruby code?
I'm not sure I understand what you did here.
Basically, I'm trying to create the data files in a more general format, that can not only be used in Naive Bayes but also other machine learning algorithms.
Can the two files be used instead of samples.json with the current Ruby code?
Yes, they surely can, they are more general in the sense of they have explicit tokens count for per file (the samples.json
contains tokens count for per language instead of per file).
Sounds like a good idea for your research on alternatives to NB.
But I don't understand why you want to add them to this repository. Would it improve the memory usage?
But I don't understand why you want to add them to this repository. Would it improve the memory usage?
That's what I'm wondering exactly. Should I commit the data file into the repo, or should I commit the program that generates the data files?
It will not directly improve the memory usage, unless we implement algorithms other than NB. And to implement other algorithms, we will need this file as samples.json
can not provide enough information for us to train a general classifier.
It will not directly improve the memory usage
It that case, I don't think you should add the files nor the program that generates them in this repository. They are only useful for your research. Only the final result of your research (i.e. the new algorithm if you find a better one) should be implemented in Linguist.
Without these files or the program generating them, other developers can't implement new algorithms. In fact, this is not only restricted to my research.
Also, even if this is for my research, other developers will not be able to execute the algorithm if they don't have these files. It's just like you need samples.json
to execute NB.
Well, could you make a gist with the program and add a link to it in this issue? You should probably also mention them and add a link in your application for GSoC.
I really think it's a good idea and a necessary step for your research but we shouldn't have research scripts in the production gem.
Wow, that's a good idea, thanks. Here is a draft of code that builds bag of words. I haven't carefully tested it though. Also, I rewrite the Samples.each
a little bit to support custom samples folders.
The gist is here: https://gist.github.com/lazywei/bedeeb1d71d2ad9f21ba
I think I might need to rephrase a little bit. This file should be able to serve as the same role as samples.json
. In other words, we can use this file for NB. Also, with this file, other developer can also try to implement new algorithm.
I agree we shouldn't put experimental related files into production gem. However, I also think we should provide a convenient way for developers to develop new algorithm. How do you think?
Thanks!
Implemented in mockingbird.
Hi,
I'm trying to implement some other classifiers as mentioned previously (e.g. #2205). Now I've created a dataset in libsvm format (training from
samples/
, testing from public gist). The reasons are:samples.json
is restricted to Naive Bayesian Classifier)However, I'm not sure if you guys are OK with the following changes, so I'd like to open an issue for discussion before I make the PR:
training.libsvm
andtesting.libsvm
that can be used for testing -- should I commit these two files, or should I commit the script that generates these two files? (I sawsamples.json
is ignored, though)rake test:classifier
be able to save the public gists it used -- so we can use them to build bag-of-words for (1). Should I commit this change? Or should I leave the "downloading public gists" task to you guys as you can use internal access?Thanks.