Closed nh2 closed 9 years ago
Yes, this is possible. Look at the Replicate
type family here: https://github.com/mikeizbicki/vector-heterogenous/blob/master/src/Data/Vector/Heterogenous/HList.hs
In your case, the code would look something like:
type NB = Bayes TH_label (Multivariate Point
'[ MultiCategorical '[Output]
, Independent Normal (Replicate 200 Double)
]
Double
)
Unfortunately, Haskell's support for such complicated types is not very good yet. For example,
Also, if you want to use a vector
to store all those Double
s, you won't be able to use the template Haskell functions for making the type indexing. You'll have to do it manually. If you show me the code you're using, I'll write the instances for you myself if you want.
I'm also planning on rewriting the type lens system I have. The current system has a limit where it takes linear time to access the type indices. In the new system this will take constant time. This should make the whole thing go considerably faster, especially in your case with 200 attributes. I'll move this to the top of the queue of things to do. Hopefully it should be done Tuesday or Wednesday next week.
Regarding your first answer: I don't see how that works quite yet. I so far thought you can only use Replicate
to go over explicitly named entries in your record, not over the elements in an entry itself.
This is what I'm trying, and it doesn't comiple:
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE MultiParamTypeClasses #-}
module HlearnBayes where
import HLearn.Algebra
import HLearn.Models.Distributions
import HLearn.Models.Classifiers
import HLearn.Models.Classifiers.Bayes
import HLearn.Models.Classifiers.Common
data Output = A | B
deriving (Read, Show, Eq, Ord)
data Point = Point
{ _output :: Output
, _features :: [Double]
} deriving (Read, Show, Eq, Ord)
makeTypeLenses ''Point
instance Labeled Point where
type Label Point = Output
type Attributes Point = Point
getLabel = _output
getAttributes p = p
type NB = Bayes TH_output (Multivariate Point
'[ MultiCategorical '[Output]
, Independent Normal (Replicate 2 Double)
]
Double
)
p1 = Point A [1,2]
p2 = Point A [2,3]
p3 = Point B [3,4]
p4 = Point B [2,1]
ps = [p1, p2, p3, p4]
toClassify = Point A [2,2]
-- Train
classifier1 = train ps :: NB
x = classify classifier1 (getAttributes toClassify)
Should it work like this?
What's going on is that the makeTypeLenses
function makes a way to access the whole _features
list, but not a way to access individual elements within the list. You'll have to do that yourself by writing an instances of Trainable
by hand.
In particular, the Trainable
class looks like:
class Trainable t where
type GetHList t
getHList :: t -> GetHList t
You'll have to make your GetHList Point
look like '[Output,Double,Double]
. What the template Haskell creates is a '[Output,[Double]]
.
See http://hackage.haskell.org/packages/archive/HLearn-distributions/1.0.0.1/doc/html/HLearn-Models-Distributions-Multivariate-Internal-TypeLens.html for an example of what the template haskell creates.
Thank you, I got it working:
-- This works with:
-- HLearn-algebra-1.0.1.1
-- HLearn-classification-1.0.1.1
-- HLearn-distributions-1.0.0.1
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE UndecidableInstances #-}
module HlearnBayes where
import HLearn.Algebra
import HLearn.Models.Distributions
import HLearn.Models.Classifiers.Bayes
import HLearn.Models.Classifiers.Common
data Output = A | B
deriving (Read, Show, Eq, Ord)
data Point = Point
{ _output :: Output
, _features :: [Double]
} deriving (Read, Show, Eq, Ord)
instance Labeled Point where
type Label Point = Output
type Attributes Point = Point
getLabel = _output
getAttributes p = p
instance Trainable Point where
-- getHList :: Point -> GetHList t
type GetHList Point = HList (Output ': Replicate 2 Double) -- needs UndecidableInstances
-- which is equivalent to:
-- type GetHList Point = HList '[Output, Double, Double]
getHList Point{ _output, _features } = _output ::: list2hlist _features
-- The following are also equivalent:
-- type GetHList Point = HList ('[Output] ++ '[Double, Double])
-- type GetHList Point = HList ('[Output] ++ Replicate 2 Double)
-- type GetHList Point = HList (Output ': '[Double, Double]) -- only this one does not need UndecidableInstances
data TH_output = TH_output
data TH_features = TH_features
instance TypeLens TH_output where
type TypeLensIndex TH_output = Nat1Box Zero
-- Not sure if this is correct for my manual instance this is multiple features in one box?
instance TypeLens TH_features where
type TypeLensIndex TH_features = Nat1Box (Succ Zero)
type NB = Bayes TH_output (Multivariate Point
'[ MultiCategorical '[Output]
, Independent Normal (Replicate 2 Double) -- same as '[Double, Double]
]
Double
)
type MyDist = Multivariate Point
'[ MultiCategorical '[Output]
, Independent Normal (Replicate 2 Double)
]
Double
-- Not needed (and also not sure if entirely correct)
-- instance MultivariateLabels Point where
-- getLabels dist = ["TH_output", "TH_features"]
p1 = Point A [1,2]
p2 = Point A [2,3]
p3 = Point B [3,4]
p4 = Point B [2,1]
ps = [p1, p2, p3, p4]
toClassify = Point A [2,2]
-- Train
classifier1 = train ps :: NB
-- dist = train ps :: MyDist
res = classify classifier1 (getAttributes toClassify)
It would be great to have a small explanation about this (how to make your own data points that can contain any container, like my list) in the haddocks.
Another question:
Repeat 200 Double
). Would it be possible (and do you plan) to make another interface to HLearn that relies less on type-level programming and allows to run the same set of algorithms?And another question:
If in my example I want to change the Attributes
to be just the [Double]
s (which makes more sense), using:
type Attributes Point = [Double]
[...] getAttributes = _features
[...]
toClassify = [2,2]
res = classify classifier1 toClassify
then I get:
/home/niklas/src/hs/hlearn-bayes-variable.hs: line 90, column 7:
Couldn't match type `[Double]' with `Point'
In the expression: classify classifier1 toClassify
In an equation for `res': res = classify classifier1 toClassify
What's going on here? I am not even sure where in the code the type mismatch actually occurs.
And third: My example with 200 Doubles doesn't work at all, since vector-heterogenous only allows Replicate
up to 20! :o
I defined myself some more type instance ToNat1 ... = Succ (ToNat1 ...)
.
Oh well, looks like type-level programming isn't quite yet capable of dealing with large inputs:
/home/niklas/src/haskell/hemokit/apps/Learning.hs: line 33, column 38:
Context reduction stack overflow; size = 201
Use -fcontext-stack=N to increase stack size to N
I am now using -fcontext-stack=500
to compile a Replicate 450 Double
.
It doesn't compile: My simple module above has been compiling for 10 minutes now, and I'm out of (8GB of) memory.
I've redone all of the code for indexing into data types to make it more efficient. It's much better now, but still not very good on hundreds of attributes. For some reason, GHC's constraint solver is taking quadratic (or worse) time on checking to see if the Multivariate
types are valid. I have no idea why it's doing that, and have opened up a stack overflow question about it. I'm hoping that it's just one of the type families is declared wrong, and not a bug in GHC.
About being less type-safe: Everything in the library can be done with less type safety. The disadvantage of that is that it pushes all of the safety checks to run time. All machine learning libraries are essentially doing it this way right now. This is actually how the earliest versions of HLearn were implemented too. My eventual goal is to use Data.Dynamic
to make run time type checks a possibility too. I'm still experimenting with any ideas to make the interface as easy as possible.
I have no idea why it's doing that, and have opened up a stack overflow question about it.
I think you should also post to haskell-cafe to get some attention.
I narrowed down the cause to being just that type families are ridiculously slow. I've posted a message to Haskell-cafe asking about it.
I've reported a bug in GHC about the slow type families also.
I benchmarked that a bit with different sizes and could get a clearly quadratic curve. Hopefully the GHC bug will be solved. Can you post a link?
Also, most of the type families in HLearn are hacks around the fact that true type level Nats aren't natively supported in GHC 7.6. This is slated to be fixed in the 7.8 release: http://ghc.haskell.org/trac/ghc/ticket/4385.
I have a boolean classification problem where the input features consist of 200 doubles.
Can I have a datapoint like
or are all HLearn datapoints required to be explicitly listed in the record?
(I know that the features will always have a given length - the problem is that I want to avoid writing a length 200 record.)
In case this is supported, how would I type a classifier for this, e.g.
Thank you!