github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.25k stars 4.23k forks source link

Possible integration with Mockingbird #2505

Closed lazywei closed 8 years ago

lazywei commented 9 years ago

Hi linguist community!

As some of you might know, I'm currently doing GSoC with @arfon @vmg @bkeepers . We are rewriting the Naive Bayesian part into Golang in here: https://github.com/lazywei/mockingbird

Now, the very simple and crude first version of mockingbird has done, the provided features:

  1. Command Line Interface
  2. Collect Rosetta Code as training samples from here
  3. Convert the collected samples to LIBSVM format (details: http://ntucsu.csie.ntu.edu.tw/~cjlin/libsvm/)
  4. Train Naive Bayesian (with libsvm format dataset) and save model into GOB format
  5. Predict libsvm format dataset from trained model

And I'd like to initiate the discussion for:

  1. Is it possible to integrate mockingbird with linguist? We're trying to improve the performance in terms of memory usage for linguist classifier.
  2. Given we have libsvm format converter, we are able to implement more classification algorithms now (i.e., we have the flexibility). Given that, is there any suggestion / preference on the algorithms to implement in mockingbird?

Thanks!

/cc @pchaigno

pchaigno commented 9 years ago

Now, the very simple and crude first version of mockingbird has done, the provided features

Awesome! I'm not familiar with Go but it seems like a good opportunity to learn :)

We're trying to improve the performance in terms of memory usage for linguist classifier.

Were you able to quantify this? What is the current memory usage?

Given that, is there any suggestion / preference on the algorithms to implement in mockingbird?

What's the current accuracy with the Naive Bayesian classifier? What are you using as test samples?

lazywei commented 9 years ago

Were you able to quantify this? What is the current memory usage?

nope, not yet. I'll do a benchmark for memory usage asap

What's the current accuracy with the Naive Bayesian classifier? What are you using as test samples?

I didn't conduct a accuracy comparison. However, I have used some test samples for make sure mockingbird and linguist's NB give the same results on the same training/testing data. (plus, there is no randomness in NB).

As for the test samples, I'm using a subset of Rosetta code data.

pchaigno commented 9 years ago

As for the test samples, I'm using a subset of Rosetta code data.

Would it be possible to use Linguist's samples in addition to Rosetta's? I know some of Rosetta codes are very short so using samples from Linguist in addition could improve the relevance of the training/test sample set. What do you think?

nemesiscodex commented 9 years ago

Given we have libsvm format converter, we are able to implement more classification algorithms now (i.e., we have the flexibility). Given that, is there any suggestion / preference on the algorithms to implement in mockingbird?

That sounds great! Hey @lazywei, What do you think about #2618?

arfon commented 8 years ago

Closing as stale.