Open lazywei opened 10 years ago
Should our pairwise interface return scalar or a vector? Detailed discussion is here: #20 (comment)
WRT to scalar, I think it may be a good idea, but I'm yet to see an exact use-case. Do we already have any algorithm that requires it?
How to organize third party libraries? For example, there is a linear_models/liblinear_src in #23. We need to agree a convention for how to include 3rd libraries.
My suggestion (although arguable) would be to try to minimize external dependencies. If there's an absolute necessity to make it, I'd make a separate repo with this particular algorithm and then provide a reference to it. But that's speculative. It'd just be great if we had less compile-time and dependency resolution problems.
WRT to scalar, I think it may be a good idea, but I'm yet to see an exact use-case. Do we already have any algorithm that requires it?
Not yet, I think. Although user can achieve the same thing by using for loop
or something like map
or apply
, but we may be able to do some optimized calculation if we implement this ourself. I have no specific opinion on this one, both (scalar, vector) are good to me.
My suggestion (although arguable) would be to try to minimize external dependencies. If there's an absolute necessity to make it, I'd make a separate repo with this particular algorithm and then provide a reference to it. But that's speculative. It'd just be great if we had less compile-time and dependency resolution problems.
I agree with that we should minimize external dependencies, and we should reduce compile-time and dependency resolution problems.
However, somehow we just need those external libraries. For example, libsvm is the best library in terms of SVM, almost all SVMs in other languages (python, R etc.) are built based on libsvm. Same situation happens to liblinear.
I'd prefer just put external libraries in our repo, though. The go get
is poor at managing dependencies. If we put external libraries in separated repo, it may induce other problems.
I have no opinion on the scalar/vector issue. I'll let you guys decide. Seems like it's working fine as it is right now.
External libraries - if we use them, we have to make sure that they are easily installable across platforms. The numpy/scikit-learn stack in Python is notoriously difficult to install - I don't want that to be the case with our library. It probably makes sense that we include them within the repo, but within a subfolder like ext, to ensure that people don't go digging around in the wrong stuff.
We should move to biogo.matrix. It seems to be the same package, but with much better documentation. If anyone has any problems, please let me know, otherwise we'll migrate.
Dataframes: only probably about static typing in Go is that we will either have strings, or float64's as labels, for categorical/continuous outcomes. How do we propose to solve this for users, without lots of ugly type assertion? Also, why would we use a string_frame? I'm not sure that I see the use case at the moment. The current dataframe looks good to me.
:+1: @lazywei should I take over moving to biogo.matrix
? If you're already familiar with it, I'd ask to let me do it, if possible.
I have no opinion on the scalar/vector issue. I'll let you guys decide. Seems like it's working fine as it is right now.
OK, then Iet's focus on scalar only return at this stage.
External libraries - if we use them, we have to make sure that they are easily installable across platforms. The numpy/scikit-learn stack in Python is notoriously difficult to install - I don't want that to be the case with our library. It probably makes sense that we include them within the repo, but within a subfolder like ext, to ensure that people don't go digging around in the wrong stuff.
Totally agree with you! I hope our library can be installed easily! ext/
sounds good to me. We could provide something like make.go
, so user can go get
+ go run make.go
to finish installation.
We should move to biogo.matrix. It seems to be the same package, but with much better documentation. If anyone has any problems, please let me know, otherwise we'll migrate.
OK, let's migrate to biogo.matrix
Dataframes: only probably about static typing in Go is that we will either have strings, or float64's as labels, for categorical/continuous outcomes. How do we propose to solve this for users, without lots of ugly type assertion? Also, why would we use a string_frame? I'm not sure that I see the use case at the moment. The current dataframe looks good to me.
I think we can first assume the labels are string, and then provide a function to convert string labels to float64 labels.
In such case, I think a Label
struct is necessary, I'll implement it.
The reason I'd like to have a StringFrame
is because I think it's possible that each row in a dataset has more than one labels, e.g.:
12.2, 0.1, 3.4, positive, happy, relax
22.3, 3.1, 1.0, negative, sad, nervous
If that is the case, we can't just use []string
to store labels. (we need [][]string
, which should be wrapped)
That being said, I think previous mentioned Label
struct can resolve this problem. But the StringFrame
will be more general. The question is, do we need StringFrame
or just using Label
is enough?
@lazywei should I take over moving to biogo.matrix? If you're already familiar with it, I'd ask to let me do it, if possible.
@ifesdjeen OK, thanks for your effort! I'll focus on DataFrame
then.
@lazywei - can you 'sketch' out an idea of what you'd want the StringFrame to look like, and how it would integrate in a training setting? It can be pseudocode - I'm just having a hard time visualising what you want it to be, at the moment.
@ifesdjeen - thanks for taking on the effort to migrate to biogo! Hopefully it should be as easy as just doing a find and replace ;)
np np, will take a closer look at it tonight.
@sjwhitworth It could be just simple manupulations. Just like string
version's matrix. The idea raised in ParseCSV
. I'd like to be able to parse CSV with multiple labels.
But what if you have a dataset that is half floats, half strings? How do you do any learning based off of that?
Oh... that's really a problem... Basically, we can train each label separately. Of course there are some algorithms need to consider all labels at the same time, but I think it might out of our scope at this moment.
I think the best way is force the labels to be all numeric. Classification labels can be converted to 0, 1, 2 etc. Regression labels can just be float64
.
So the problem is should we automatically convert classification labels to numeric in dataset I/O?
How about something like
type Label struct {
values *mat64.Dense
categories map[int](map[int]string)
}
For example,
values = [[0, 1, 3.12], [1, 0, 5.134], ...]
categories = {
0: {0: "happy", 1: "sad"},
1: {0: "positive", 1: "negative"},
2: "regression values"
}
So we can have all labels in numeric, and we can still know what these values mean (which category, regression or classification etc.)
Sounds good to me. Label encoding built in. Nice. :)
Nope! Let's do it!
All agreed @ifesdjeen @npbool @marcoseravalli @macmania ?
Wow, there's been a lot of activity on this since May 1st! I forked it off with a view to implement some of the algorithms I struggle with (context: I'm revising for a course in Data Mining). If you're familiar with WEKA (as I am) they have (IMHO) a nice solution to this problem that I've implemented (see instances.go and attributes.go. Instances contains the underlying memory (kept in a go.matrix still) and a slice of Attributes, which impose structure on the data and convert the native float64 format into something meaningful. I've implemented two Attribute types (CategoricalAttribute - which you can use to hold binned values, class values etc) and FloatAttribute which directly maps to the underlying type. This is all unit-tested and ready to go. Also see the docs.
Advantages
Disadvantages
All in all, really promising project so far, let's hope I can save @lazywei some work.
@Sentimentron
Wow, that's really awesome. I think your Instance
is basically the Label
I want to implement.
I have, however, some concerns:
Instance
for labels is good, but would it be a over-kill for storing features? I mean, in most cases, features are just numeric values. Of course, sometimes features may be categorical. I have no too many experiences in training categorical datas, so I'm just wondering do we really need to deal with those values in our library? That being said, if the cost is cheap (in terms of memory usage, cpu usage etc.), I have no opinion on this :-)Instance
seems a little ambiguous. In my experience, instances are usually referred to training datas (more specifically, the training features). However, in your code, it seems that Instance
is more general then that. It seems that we can use Instance
for both training features and training labels, or even other data structures. Therefore, I think it may be good if we can come out a more meaningful name. (This is really a minor concern, though)@sjwhitworth do you have any suggestions or ideas on this matter?
Anyway, thanks your effort. I really like your implementation, it saves my life :+1: By the way, a little off-topic, if you are familiar with WEKA, and if you have time, could you help me implement the I/O functions of ARFF format? Thanks.
Let's address those concerns:
And edit: I'm also familiar with ARFF format, it's super-simple and essentially CSV apart from the header which specifies the types for each attribute explictly. Because I already revised the CSV importer quite a lot to use the new Instances type, about 90% of the code needed to support ARFF already exists and I have lots of them lying around (in various states of validity) for unit-testing.
Hey guys sorry for being gone for quite some time (we had a major release, so I completely failed to keep up with OSS schedule), back on track now, gotta take care of matrix migration, hope it's still relevant.
Glad to still see some discussions and activity here.
I had some concerns about using biogo.matrix.
In particular, it parovides no support for eigen or singular value decomposition, which are important for a plethora of dimensionality reduction problems. Gonum's mat64 package, on the other hand, supports both. Additionally, the goals of the biogo.matrix library seem to be primarily to act as a supplement to the biogo bioinformatics project. I don't foresee the library evolving to include the flexibility that a linear algebra library such as mat64 would provide. However, I'm pretty certain that such flexibility will be beneficial for our project.
I understand that mat64 is somewhat lacking in adequate documentation, but in light of the features that biogo.matrix lacks perhaps we could rethink the migration.
Any insights on this?
@hpxro7 I have no idea about how much work need to be done if we choose to rollback to mat64. On the other hand, would you think it is possible to implement those eigen computations ourself? If so, I think we can work on this together, while others can focus on ML algorithms.
If there are already basic sparse/dense matrix arithmetics, I think it won't be too hard to implement something like arnoldi iteration?
If we were to rollback to mat64
, that's not a problem for me: just have to revert the code which allocates and accesses the matrix.
Looking at my fork off master, it seems like all of the matrix related code sits atop mat64. I couldn't find any references to biogo. I'm assuming then that most of the code written in biogo has yet to be pulled into master?
@lazywei I think that might be a cool idea, but I'd fathom that rolling back any biogo.matrix instances to mat64 would be far less challenging compared to re-implementing these somewhat involved linear algebra algorithms. My opinion is that since there is already a decent implementation of a matrix library, we could stick to using that instead of replicating what has already been done.
OK, I think switch back to gonum is an acceptable choice. If all of you guys think it is a good idea, then let's do it. I can work on the docs. Also, I think it would be a good idea that we stick to other gonum's packages: https://github.com/gonum It may be help, and I think it can save us much time. :+1:
As mentioned in other issues, there are some decisions we need to make.
mat64
lack docs, but author replies to the issues very fast, optimized memory usage.biogo.matrix
docs are quite good, but I have no experience in using this.base
package, due to it is related to many other packages in golearn.linear_models/liblinear_src
in https://github.com/sjwhitworth/golearn/pull/23. We need to agree a convention for how to include 3rd libraries.Please leave comments about above issues. We should settle down these issues first. @sjwhitworth @ifesdjeen @npbool @marcoseravalli @macmania