sjwhitworth / golearn

Machine Learning for Go
MIT License
9.29k stars 1.19k forks source link

Some interfaces / dependence discussion #24

Open lazywei opened 10 years ago

lazywei commented 10 years ago

As mentioned in other issues, there are some decisions we need to make.

Please leave comments about above issues. We should settle down these issues first. @sjwhitworth @ifesdjeen @npbool @marcoseravalli @macmania

ifesdjeen commented 10 years ago

Should our pairwise interface return scalar or a vector? Detailed discussion is here: #20 (comment)

WRT to scalar, I think it may be a good idea, but I'm yet to see an exact use-case. Do we already have any algorithm that requires it?

How to organize third party libraries? For example, there is a linear_models/liblinear_src in #23. We need to agree a convention for how to include 3rd libraries.

My suggestion (although arguable) would be to try to minimize external dependencies. If there's an absolute necessity to make it, I'd make a separate repo with this particular algorithm and then provide a reference to it. But that's speculative. It'd just be great if we had less compile-time and dependency resolution problems.

lazywei commented 10 years ago

WRT to scalar, I think it may be a good idea, but I'm yet to see an exact use-case. Do we already have any algorithm that requires it?

Not yet, I think. Although user can achieve the same thing by using for loop or something like map or apply, but we may be able to do some optimized calculation if we implement this ourself. I have no specific opinion on this one, both (scalar, vector) are good to me.

My suggestion (although arguable) would be to try to minimize external dependencies. If there's an absolute necessity to make it, I'd make a separate repo with this particular algorithm and then provide a reference to it. But that's speculative. It'd just be great if we had less compile-time and dependency resolution problems.

I agree with that we should minimize external dependencies, and we should reduce compile-time and dependency resolution problems. However, somehow we just need those external libraries. For example, libsvm is the best library in terms of SVM, almost all SVMs in other languages (python, R etc.) are built based on libsvm. Same situation happens to liblinear. I'd prefer just put external libraries in our repo, though. The go get is poor at managing dependencies. If we put external libraries in separated repo, it may induce other problems.

sjwhitworth commented 10 years ago

I have no opinion on the scalar/vector issue. I'll let you guys decide. Seems like it's working fine as it is right now.

External libraries - if we use them, we have to make sure that they are easily installable across platforms. The numpy/scikit-learn stack in Python is notoriously difficult to install - I don't want that to be the case with our library. It probably makes sense that we include them within the repo, but within a subfolder like ext, to ensure that people don't go digging around in the wrong stuff.

We should move to biogo.matrix. It seems to be the same package, but with much better documentation. If anyone has any problems, please let me know, otherwise we'll migrate.

Dataframes: only probably about static typing in Go is that we will either have strings, or float64's as labels, for categorical/continuous outcomes. How do we propose to solve this for users, without lots of ugly type assertion? Also, why would we use a string_frame? I'm not sure that I see the use case at the moment. The current dataframe looks good to me.

ifesdjeen commented 10 years ago

:+1: @lazywei should I take over moving to biogo.matrix? If you're already familiar with it, I'd ask to let me do it, if possible.

lazywei commented 10 years ago

I have no opinion on the scalar/vector issue. I'll let you guys decide. Seems like it's working fine as it is right now.

OK, then Iet's focus on scalar only return at this stage.

External libraries - if we use them, we have to make sure that they are easily installable across platforms. The numpy/scikit-learn stack in Python is notoriously difficult to install - I don't want that to be the case with our library. It probably makes sense that we include them within the repo, but within a subfolder like ext, to ensure that people don't go digging around in the wrong stuff.

Totally agree with you! I hope our library can be installed easily! ext/ sounds good to me. We could provide something like make.go, so user can go get + go run make.go to finish installation.

We should move to biogo.matrix. It seems to be the same package, but with much better documentation. If anyone has any problems, please let me know, otherwise we'll migrate.

OK, let's migrate to biogo.matrix

Dataframes: only probably about static typing in Go is that we will either have strings, or float64's as labels, for categorical/continuous outcomes. How do we propose to solve this for users, without lots of ugly type assertion? Also, why would we use a string_frame? I'm not sure that I see the use case at the moment. The current dataframe looks good to me.

I think we can first assume the labels are string, and then provide a function to convert string labels to float64 labels. In such case, I think a Label struct is necessary, I'll implement it. The reason I'd like to have a StringFrame is because I think it's possible that each row in a dataset has more than one labels, e.g.:

12.2, 0.1, 3.4, positive, happy, relax
22.3, 3.1, 1.0, negative, sad, nervous

If that is the case, we can't just use []string to store labels. (we need [][]string, which should be wrapped) That being said, I think previous mentioned Label struct can resolve this problem. But the StringFrame will be more general. The question is, do we need StringFrame or just using Label is enough?

@lazywei should I take over moving to biogo.matrix? If you're already familiar with it, I'd ask to let me do it, if possible.

@ifesdjeen OK, thanks for your effort! I'll focus on DataFrame then.

sjwhitworth commented 10 years ago

@lazywei - can you 'sketch' out an idea of what you'd want the StringFrame to look like, and how it would integrate in a training setting? It can be pseudocode - I'm just having a hard time visualising what you want it to be, at the moment.

@ifesdjeen - thanks for taking on the effort to migrate to biogo! Hopefully it should be as easy as just doing a find and replace ;)

ifesdjeen commented 10 years ago

np np, will take a closer look at it tonight.

lazywei commented 10 years ago

@sjwhitworth It could be just simple manupulations. Just like string version's matrix. The idea raised in ParseCSV. I'd like to be able to parse CSV with multiple labels.

sjwhitworth commented 10 years ago

But what if you have a dataset that is half floats, half strings? How do you do any learning based off of that?

lazywei commented 10 years ago

Oh... that's really a problem... Basically, we can train each label separately. Of course there are some algorithms need to consider all labels at the same time, but I think it might out of our scope at this moment.

I think the best way is force the labels to be all numeric. Classification labels can be converted to 0, 1, 2 etc. Regression labels can just be float64. So the problem is should we automatically convert classification labels to numeric in dataset I/O? How about something like

type Label struct {
values *mat64.Dense
categories map[int](map[int]string)
}

For example,

values = [[0, 1, 3.12], [1, 0, 5.134], ...]
categories = {
0: {0: "happy", 1: "sad"},
1: {0: "positive", 1: "negative"},
2: "regression values"
}

So we can have all labels in numeric, and we can still know what these values mean (which category, regression or classification etc.)

sjwhitworth commented 10 years ago

Sounds good to me. Label encoding built in. Nice. :)

lazywei commented 10 years ago

OK! Let's GO!

Summary:

Any other suggestions? LGTM

sjwhitworth commented 10 years ago

Nope! Let's do it!

sjwhitworth commented 10 years ago

All agreed @ifesdjeen @npbool @marcoseravalli @macmania ?

Sentimentron commented 10 years ago

Wow, there's been a lot of activity on this since May 1st! I forked it off with a view to implement some of the algorithms I struggle with (context: I'm revising for a course in Data Mining). If you're familiar with WEKA (as I am) they have (IMHO) a nice solution to this problem that I've implemented (see instances.go and attributes.go. Instances contains the underlying memory (kept in a go.matrix still) and a slice of Attributes, which impose structure on the data and convert the native float64 format into something meaningful. I've implemented two Attribute types (CategoricalAttribute - which you can use to hold binned values, class values etc) and FloatAttribute which directly maps to the underlying type. This is all unit-tested and ready to go. Also see the docs.

Advantages

Disadvantages

All in all, really promising project so far, let's hope I can save @lazywei some work.

lazywei commented 10 years ago

@Sentimentron Wow, that's really awesome. I think your Instance is basically the Label I want to implement. I have, however, some concerns:

@sjwhitworth do you have any suggestions or ideas on this matter?

Anyway, thanks your effort. I really like your implementation, it saves my life :+1: By the way, a little off-topic, if you are familiar with WEKA, and if you have time, could you help me implement the I/O functions of ARFF format? Thanks.

Sentimentron commented 10 years ago

Let's address those concerns:

And edit: I'm also familiar with ARFF format, it's super-simple and essentially CSV apart from the header which specifies the types for each attribute explictly. Because I already revised the CSV importer quite a lot to use the new Instances type, about 90% of the code needed to support ARFF already exists and I have lots of them lying around (in various states of validity) for unit-testing.

ifesdjeen commented 10 years ago

Hey guys sorry for being gone for quite some time (we had a major release, so I completely failed to keep up with OSS schedule), back on track now, gotta take care of matrix migration, hope it's still relevant.

Glad to still see some discussions and activity here.

hpxro7 commented 10 years ago

I had some concerns about using biogo.matrix.

In particular, it parovides no support for eigen or singular value decomposition, which are important for a plethora of dimensionality reduction problems. Gonum's mat64 package, on the other hand, supports both. Additionally, the goals of the biogo.matrix library seem to be primarily to act as a supplement to the biogo bioinformatics project. I don't foresee the library evolving to include the flexibility that a linear algebra library such as mat64 would provide. However, I'm pretty certain that such flexibility will be beneficial for our project.

I understand that mat64 is somewhat lacking in adequate documentation, but in light of the features that biogo.matrix lacks perhaps we could rethink the migration.

Any insights on this?

lazywei commented 10 years ago

@hpxro7 I have no idea about how much work need to be done if we choose to rollback to mat64. On the other hand, would you think it is possible to implement those eigen computations ourself? If so, I think we can work on this together, while others can focus on ML algorithms.

If there are already basic sparse/dense matrix arithmetics, I think it won't be too hard to implement something like arnoldi iteration?

Sentimentron commented 10 years ago

If we were to rollback to mat64, that's not a problem for me: just have to revert the code which allocates and accesses the matrix.

hpxro7 commented 10 years ago

Looking at my fork off master, it seems like all of the matrix related code sits atop mat64. I couldn't find any references to biogo. I'm assuming then that most of the code written in biogo has yet to be pulled into master?

@lazywei I think that might be a cool idea, but I'd fathom that rolling back any biogo.matrix instances to mat64 would be far less challenging compared to re-implementing these somewhat involved linear algebra algorithms. My opinion is that since there is already a decent implementation of a matrix library, we could stick to using that instead of replicating what has already been done.

lazywei commented 10 years ago

OK, I think switch back to gonum is an acceptable choice. If all of you guys think it is a good idea, then let's do it. I can work on the docs. Also, I think it would be a good idea that we stick to other gonum's packages: https://github.com/gonum It may be help, and I think it can save us much time. :+1: