sethjuarez / numl

Machine Learning for .NET
http://numl.net
MIT License
430 stars 104 forks source link

Complex Feature type with Property Path support #39

Open bdschrisk opened 8 years ago

bdschrisk commented 8 years ago

When using the description method in numl the Descriptor object throws an exception when a Feature or Label attribute is defined on a complex type property. Properties can be complex types from external libraries thus preventing numl attribute usage.

For example given the below type: public class Foo { [Feature] public Bar One { get; set; } [Label] public bool IsOK { get; set; } }

public class Bar { public int A { get; set; } public int B { get; set; } public int C { get; set; } }

The descriptor would throw an error on converting the type Bar to a double. This is the same for nullable type properties also.

A suggested implementation is to use property path(s) in the Feature attribute. This would allow the Descriptor to extract one or more sub properties from the complex type.

Suggested Implementation: public class Foo { [FeatureSelector("Bar.A", Bar.B", "Bar.C")] public Bar One { get; set; } [Label] public bool IsOK { get; set; } }

sethjuarez commented 8 years ago

In this case it is better to create a custom property. You can inherit from the Property class to do your own multi valued conversion. This can then be added to the feature collection.

ghost commented 8 years ago

I can give the custom property a try in numlsample and see how that goes. On Nov 11, 2015 9:40 PM, "Seth Juarez" notifications@github.com wrote:

In this case it is better to create a custom property. You can inherit from the Property class to do your own multi valued conversion. This can then be added to the feature collection.

— Reply to this email directly or view it on GitHub https://github.com/sethjuarez/numl/issues/39#issuecomment-155987565.

ghost commented 8 years ago

I've been looking into the innards of the descriptor building mechanism for this, and I have a question.

Discrete is defined as a property on the Property class:

public bool Discrete { get; set; }

This works for both normal features/properties as well as enumerable ones, because enumerable properties are all of the same type. But if we have a complex type with discrete and non-discrete values, then how would this be resolved?

As far as I can tell, Discrete is only used for logic flow in two places (currently in portable): line 225 in DecisionTreeGenerator.cs and line 35 in NaiveBayesGenerator.cs.

I would think the NaiveBayesGenerator can be ignored I would think, because it is a label and I wouldn't think labels should be complex types (though perhaps I'm wrong?).

But with the DecisionTreeGenerator (and possibly other future algorithms), we definitely have logic based on this.

One possible change to address this could be changing the Discrete from a bool property to a IsDiscrete(int i) virtual method, and maybe giving it a default value to show its usual use: IsDiscrete(int i = 0). This would also need changes in the serialization aspects as well I think.

Thoughts?

sethjuarez commented 8 years ago

Discrete is an interesting artifact of assuming that there would always be a one-to-one correspondence between a property and a single (double) value. Under this assumption there were two categories of numbers (in my original opinion):

  1. Discrete - binary and small subset of integers
  2. Everything else

This worked well under the above mentioned assumption - not so sure when there is a one-to-many correspondence between a property and the generated numbers.

An additional assertion - the descriptor was never intended to be used to flatten object graphs into vectors. My hope was that there would first be a projection onto a flattened structure that would subsequently be described by the descriptor. I initially went down this path but quickly realized I would be re-implementing a lot of what already existed in AutoMapper. This is why I put work into the descriptor working even with anonymous types (it does). You could theoretically flatten into an anonymous type and couple it with a weak descriptor.

What do you think?

ghost commented 8 years ago

Hmm...I think the discrete/everything else split was a good decision.

As for your additional assertion, it has been my understanding that the descriptor is the OOP-to-math bridge, enabling the wrapping/consumption/simplification of the ML complexities in object terms (the main reason I chose numl in the first place). As such, I personally like the object graph flattening being encapsulated in the descriptor and I appreciate your distinctions of strong/weak descriptors in the comments of your blog. So I think that was a good decision also.

And using those terms, this issue is simply an implementation detail of a strong descriptor, and I think it can be done as you mentioned by creating custom Property and Attribute classes (which is what I've done so far). I might just have to tweak the Discrete property, possibly by making it a virtual property with (obviously) its current implementation as the base implementation. This would avoid a breaking change and also allow the mapping process to handle the one-to-many relationship of properties to matrix numbers. If I can, then it's conceivable I could do an attribute as is mentioned by @bdschrisk to provide some kind of syntactic sugar for this.

I don't know for sure though, I just wanted to get your thoughts on this before going ahead. I'll go ahead and try to implement this Discrete tweak and see how it looks. Though I haven't dug into your process of math mapping back to the descriptor (label) just yet, it should work (famous last words)...I've been meaning to get some kind of T-shirt and/or coffee mug made:

"That should work! 8-O"

ghost commented 8 years ago

Well I've committed the changes to my local forks (numlsample, and numl itself, but I am unsure how to proceed with the pull request. I say this, because there are actually two would-be pull requests that are related: numlsample and numl itself.

I had to slightly tweak Property to get the Discrete mapping correct in numl, and so the numlsample pull request would be dependent upon the numl pull request. Also, I haven't implemented unit tests that I want to, as I need some guidance on how to fit it in to your unit testing structure.