mmaul / clml

Common Lisp Machine Learning Library
Other
259 stars 36 forks source link

ARFF file format reader #32

Closed neil-lindquist closed 4 years ago

neil-lindquist commented 6 years ago

I wasn't sure if this would be better in this repository or in the clml.extras repository (or even as a separate quicklisp project/repo), so I created this issue to get feedback on that. If it were part of this repository, :arff could become another type for read-data-from-file

ARFF is a file format that was created for use with Weka (a data mining program). It stores the column names and types in the header of the file, then the data in a csv-like section. Other useful features include allowing comments in the file and having a specification for sparse data. I'm not totally sure how prevalent arff format is, the professor for my data mining class likes it and there exist arff readers for various languages (including R, Python, Java and C++)

ARFF specs: http://weka.wikispaces.com/ARFF%20%28stable%20version%29

I've started implementing this, https://github.com/neil-lindquist/clml-arff-prototype. Below is an example of the file csv-vs-arff.lisp loaded with sbcl. The same data is loaded in arff and csv formats. The decision trees are created with a pre-pruning epsilon of 0.05 to ease readability.

arff data: #<CLML.HJS.READ-DATA:NUMERIC-AND-CATEGORY-DATASET >
DIMENSIONS: sepal-length | sepal-width | petal-length | petal-width | class
TYPES:      NUMERIC | NUMERIC | NUMERIC | NUMERIC | CATEGORY
NUMBER OF DIMENSIONS: 5
CATEGORY DATA POINTS: 150 POINTS
NUMERIC DATA POINTS: 150 POINTS

csv data: #<CLML.HJS.READ-DATA:UNSPECIALIZED-DATASET >
DIMENSIONS: sepal-length | sepal-width | petal-length | petal-width | class
TYPES:      UNKNOWN | UNKNOWN | UNKNOWN | UNKNOWN | UNKNOWN
NUMBER OF DIMENSIONS: 5
DATA POINTS: 150 POINTS

arff decision tree:
[3.0d0 <= petal-length?]((Iris-setosa . 50) (Iris-versicolor . 50)
                         (Iris-virginica . 50))
   Yes->[1.7999999523162842d0 <= petal-width?]((Iris-virginica . 50)
                                               (Iris-versicolor . 50))
      Yes->((Iris-versicolor . 1) (Iris-virginica . 45))
      No->[5.0d0 <= petal-length?]((Iris-versicolor . 49) (Iris-virginica . 5))
         Yes->[1.600000023841858d0 <= petal-width?]((Iris-virginica . 4)
                                                    (Iris-versicolor . 2))
            Yes->[7.199999809265137d0 <= sepal-length?]((Iris-versicolor . 2)
                                                        (Iris-virginica . 1))
               Yes->((Iris-virginica . 1))
               No->((Iris-versicolor . 2))
            No->((Iris-virginica . 3))
         No->((Iris-virginica . 1) (Iris-versicolor . 47))
   No->((Iris-setosa . 50))
csv decision tree:
[3.0d0 <= petal-length?]((Iris-virginica . 50) (Iris-versicolor . 50)
                         (Iris-setosa . 50))
   Yes->[1.7999999523162842d0 <= petal-width?]((Iris-versicolor . 50)
                                               (Iris-virginica . 50))
      Yes->((Iris-virginica . 45) (Iris-versicolor . 1))
      No->[5.0d0 <= petal-length?]((Iris-virginica . 5) (Iris-versicolor . 49))
         Yes->[1.600000023841858d0 <= petal-width?]((Iris-versicolor . 2)
                                                    (Iris-virginica . 4))
            Yes->[7.0d0 <= sepal-length?]((Iris-virginica . 1)
                                          (Iris-versicolor . 2))
               Yes->((Iris-virginica . 1))
               No->((Iris-versicolor . 2))
            No->((Iris-virginica . 3))
         No->((Iris-versicolor . 47) (Iris-virginica . 1))
   No->((Iris-setosa . 50))
mmaul commented 6 years ago

Great idea!!!!! Let me know when you get your clml-arff reader finished with doc, examples and tests if you don't mind. Then I will pull it into clml.

guicho271828 commented 6 years ago

btw, regarding file reader, hdf5-cffi was added to quicklisp recently, so someone could work on it too.