Can I use variable-length things as data points?

nh2 commented 11 years ago

I have a boolean classification problem where the input features consist of 200 doubles.

Can I have a datapoint like

data Point = Point
  { _label :: Bool
  , _features :: [Double]
  }

or are all HLearn datapoints required to be explicitly listed in the record?

(I know that the features will always have a given length - the problem is that I want to avoid writing a length 200 record.)

In case this is supported, how would I type a classifier for this, e.g.

type NB = Bayes TH_label (Multivariate Point
                            '[ MultiCategorical   '[Output]
                             , Independent Normal '[ ??? all the _features ]
                             ]
                             Double
                          )

Thank you!

mikeizbicki commented 11 years ago

Yes, this is possible. Look at the Replicate type family here: https://github.com/mikeizbicki/vector-heterogenous/blob/master/src/Data/Vector/Heterogenous/HList.hs

In your case, the code would look something like:

type NB = Bayes TH_label (Multivariate Point
                        '[ MultiCategorical   '[Output]
                         , Independent Normal (Replicate 200 Double)
                         ]
                         Double
                      )

Unfortunately, Haskell's support for such complicated types is not very good yet. For example,

cabal cannot provide documentation for them
if you use the wrong type on accident, the type error message will be VERY long.

mikeizbicki commented 11 years ago

Also, if you want to use a vector to store all those Doubles, you won't be able to use the template Haskell functions for making the type indexing. You'll have to do it manually. If you show me the code you're using, I'll write the instances for you myself if you want.

I'm also planning on rewriting the type lens system I have. The current system has a limit where it takes linear time to access the type indices. In the new system this will take constant time. This should make the whole thing go considerably faster, especially in your case with 200 attributes. I'll move this to the top of the queue of things to do. Hopefully it should be done Tuesday or Wednesday next week.

nh2 commented 11 years ago

Regarding your first answer: I don't see how that works quite yet. I so far thought you can only use Replicate to go over explicitly named entries in your record, not over the elements in an entry itself.

This is what I'm trying, and it doesn't comiple:

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE MultiParamTypeClasses #-}

module HlearnBayes where

import HLearn.Algebra
import HLearn.Models.Distributions
import HLearn.Models.Classifiers
import HLearn.Models.Classifiers.Bayes
import HLearn.Models.Classifiers.Common

data Output = A | B
            deriving (Read, Show, Eq, Ord)

data Point = Point
  { _output   :: Output
  , _features :: [Double]
  } deriving (Read, Show, Eq, Ord)

makeTypeLenses ''Point

instance Labeled Point where
  type Label      Point = Output
  type Attributes Point = Point
  getLabel = _output
  getAttributes p = p

type NB = Bayes TH_output (Multivariate Point
                            '[ MultiCategorical   '[Output]
                             , Independent Normal (Replicate 2 Double)
                             ]
                             Double
                          )

p1 = Point A [1,2]
p2 = Point A [2,3]
p3 = Point B [3,4]
p4 = Point B [2,1]

ps = [p1, p2, p3, p4]

toClassify = Point A [2,2]

-- Train
classifier1 = train ps :: NB

x = classify classifier1 (getAttributes toClassify)

Should it work like this?

mikeizbicki commented 11 years ago

What's going on is that the makeTypeLenses function makes a way to access the whole _features list, but not a way to access individual elements within the list. You'll have to do that yourself by writing an instances of Trainable by hand.

In particular, the Trainable class looks like:

class Trainable t where
    type GetHList t 
    getHList :: t -> GetHList t

You'll have to make your GetHList Point look like '[Output,Double,Double]. What the template Haskell creates is a '[Output,[Double]].

See http://hackage.haskell.org/packages/archive/HLearn-distributions/1.0.0.1/doc/html/HLearn-Models-Distributions-Multivariate-Internal-TypeLens.html for an example of what the template haskell creates.

nh2 commented 11 years ago

Thank you, I got it working:

-- This works with:
--     HLearn-algebra-1.0.1.1
--     HLearn-classification-1.0.1.1
--     HLearn-distributions-1.0.0.1

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE UndecidableInstances #-}

module HlearnBayes where

import HLearn.Algebra
import HLearn.Models.Distributions
import HLearn.Models.Classifiers.Bayes
import HLearn.Models.Classifiers.Common

data Output = A | B
            deriving (Read, Show, Eq, Ord)

data Point = Point
  { _output   :: Output
  , _features :: [Double]
  } deriving (Read, Show, Eq, Ord)

instance Labeled Point where
  type Label      Point = Output
  type Attributes Point = Point
  getLabel = _output
  getAttributes p = p

instance Trainable Point where
  -- getHList :: Point -> GetHList t
  type GetHList Point = HList (Output ': Replicate 2 Double) -- needs UndecidableInstances
  -- which is equivalent to:
  -- type GetHList Point = HList '[Output, Double, Double]
  getHList Point{ _output, _features } = _output ::: list2hlist _features
  -- The following are also equivalent:
  -- type GetHList Point = HList ('[Output] ++ '[Double, Double])
  -- type GetHList Point = HList ('[Output] ++ Replicate 2 Double)
  -- type GetHList Point = HList (Output ': '[Double, Double]) -- only this one does not need UndecidableInstances

data TH_output   = TH_output
data TH_features = TH_features

instance TypeLens TH_output where
  type TypeLensIndex TH_output = Nat1Box Zero
-- Not sure if this is correct for my manual instance this is multiple features in one box?
instance TypeLens TH_features where
  type TypeLensIndex TH_features = Nat1Box (Succ Zero)

type NB = Bayes TH_output (Multivariate Point
                            '[ MultiCategorical   '[Output]
                             , Independent Normal (Replicate 2 Double) -- same as '[Double, Double]
                             ]
                             Double
                          )

type MyDist = Multivariate Point
                            '[ MultiCategorical   '[Output]
                             , Independent Normal (Replicate 2 Double)
                             ]
                             Double

-- Not needed (and also not sure if entirely correct)
-- instance MultivariateLabels Point where
--   getLabels dist = ["TH_output", "TH_features"]

p1 = Point A [1,2]
p2 = Point A [2,3]
p3 = Point B [3,4]
p4 = Point B [2,1]

ps = [p1, p2, p3, p4]

toClassify = Point A [2,2]

-- Train
classifier1 = train ps :: NB

-- dist = train ps :: MyDist

res = classify classifier1 (getAttributes toClassify)

It would be great to have a small explanation about this (how to make your own data points that can contain any container, like my list) in the haddocks.

Another question:

This solves my problem, since I know that I will have 200 Doubles in my datapoint. In other applications, this will not work though, as the number of features might not be known at compile time and therefore cannot be part of the type signature (Repeat 200 Double). Would it be possible (and do you plan) to make another interface to HLearn that relies less on type-level programming and allows to run the same set of algorithms?
Will the main idea behind HLearn, using Monoids and Homomorphisms, still work smoothly on a lesser typed interface, or would it lose more that a bit of type-safety?

nh2 commented 11 years ago

And another question:

If in my example I want to change the Attributes to be just the [Double]s (which makes more sense), using:

type Attributes Point = [Double]
[...] getAttributes = _features
[...]
toClassify = [2,2]
res = classify classifier1 toClassify

then I get:

/home/niklas/src/hs/hlearn-bayes-variable.hs: line 90, column 7:
  Couldn't match type `[Double]' with `Point'
  In the expression: classify classifier1 toClassify
  In an equation for `res': res = classify classifier1 toClassify

What's going on here? I am not even sure where in the code the type mismatch actually occurs.

nh2 commented 11 years ago

And third: My example with 200 Doubles doesn't work at all, since vector-heterogenous only allows Replicate up to 20! :o

I defined myself some more type instance ToNat1 ... = Succ (ToNat1 ...).

nh2 commented 11 years ago

Oh well, looks like type-level programming isn't quite yet capable of dealing with large inputs:

/home/niklas/src/haskell/hemokit/apps/Learning.hs: line 33, column 38:
  Context reduction stack overflow; size = 201
  Use -fcontext-stack=N to increase stack size to N

I am now using -fcontext-stack=500 to compile a Replicate 450 Double.

It doesn't compile: My simple module above has been compiling for 10 minutes now, and I'm out of (8GB of) memory.

mikeizbicki commented 11 years ago

I've redone all of the code for indexing into data types to make it more efficient. It's much better now, but still not very good on hundreds of attributes. For some reason, GHC's constraint solver is taking quadratic (or worse) time on checking to see if the Multivariate types are valid. I have no idea why it's doing that, and have opened up a stack overflow question about it. I'm hoping that it's just one of the type families is declared wrong, and not a bug in GHC.

About being less type-safe: Everything in the library can be done with less type safety. The disadvantage of that is that it pushes all of the safety checks to run time. All machine learning libraries are essentially doing it this way right now. This is actually how the earliest versions of HLearn were implemented too. My eventual goal is to use Data.Dynamic to make run time type checks a possibility too. I'm still experimenting with any ideas to make the interface as easy as possible.

nh2 commented 11 years ago

I have no idea why it's doing that, and have opened up a stack overflow question about it.

I think you should also post to haskell-cafe to get some attention.

mikeizbicki commented 11 years ago

I narrowed down the cause to being just that type families are ridiculously slow. I've posted a message to Haskell-cafe asking about it.

mikeizbicki commented 11 years ago

I've reported a bug in GHC about the slow type families also.

nh2 commented 11 years ago

I benchmarked that a bit with different sizes and could get a clearly quadratic curve. Hopefully the GHC bug will be solved. Can you post a link?

mikeizbicki commented 11 years ago

http://ghc.haskell.org/trac/ghc/ticket/8095

mikeizbicki commented 11 years ago

Also, most of the type families in HLearn are hacks around the fact that true type level Nats aren't natively supported in GHC 7.6. This is slated to be fixed in the 7.8 release: http://ghc.haskell.org/trac/ghc/ticket/4385.

mikeizbicki / HLearn

Can I use variable-length things as data points? #13