mbillingr / arff

ARFF file format serializer and deserializer
https://docs.rs/arff/
Apache License 2.0
3 stars 1 forks source link

Example of OpenML arff deserialization #5

Open rth opened 5 years ago

rth commented 5 years ago

Thanks for this crate!

I'm trying to load the MNIST dataset from OpenML in Arff format, and so far

extern crate serde;

extern crate arff;    
use std::fs;   
use serde::Deserialize;

fn main() {

    let contents = fs::read_to_string("/tmp/mnist_784.arff").expect("Error!");

    let unnamed_data: Vec<(f32,)> = arff::from_str(&contents).unwrap();   
    println!("{:?}", unnamed_data);
}

panics due to column dtype validation,

   Compiling arff-rust-parser v0.1.0 (/home/rth/projects/scikit-learn/arff-rust-parser)
    Finished dev [unoptimized + debuginfo] target(s) in 0.54s
     Running `target/debug/arff-rust-parser`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidColumnType(TextPos { line: 11, column: 18 }, "\tREAL")', src/libcore/result.rs:999:5
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

possibly because the metadata says that colums are real, while they are integers.

@ATTRIBUTE pixel1       real
@ATTRIBUTE pixel2       real
@ATTRIBUTE pixel3       real

@DATA
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 [..] 0,0,5

It would be nice if there was an example of loading a dataset from OpenML. A am aware of openml-rust but I'm looking for just a fast ARFF parser that I could use as a replacement for liac-arff.

mbillingr commented 5 years ago

Thank you for reporting that issue!

The problem is caused by the \t character between attribute name and data type, which is not parsed as whitespace. A quick fix for you would be to replace these tabs with spaces.

Unfortunately, the ARFF format is rather informally specified so it is not clear if that is a problem with the file or with my implementation. However, I will try to make the parser more tolerant because loading data from OpenML is my most important use case :)

I agree that a few examples would be nice to have. You got the code almost right, though.

This lets you load the data into nested Vecs:

let unnamed_data: Vec<Vec<f32>> = arff::from_str(&contents).unwrap();

(Note that this fragments the data in memory: columns (images in this case) are contiguous but rows are not.)

Alternatively, if you want to load the whole data set into a contiguous block of memory:

let unnamed_data: Vec<f32> = arff::flat_from_str(&contents).unwrap();
rth commented 5 years ago

Thanks for your response!

I can confirm that adding a

let contents = contents.replace("\t", " ");

fixes the issue.

So on my laptop this loads MNIST in 4.6s as compared to 17s with liac-arff which is quite nice (though I guess there is additional overhead to be expected in converting the output to a contiguous array).

mbillingr commented 5 years ago

Good to hear about the performance difference. Your laptop is fast... This takes about 6s on my desktop machine.

I would have expected loading into a contiguous array to be faster but it turns out to be slightly slower than the Vec<Vec<_>> variant. Perhaps it needs to reallocate more often as the size is not known in advance. I'm not familiar enough with Serde internals to say for sure.