Open rth opened 5 years ago
Thank you for reporting that issue!
The problem is caused by the \t
character between attribute name and data type, which is not parsed as whitespace. A quick fix for you would be to replace these tabs with spaces.
Unfortunately, the ARFF format is rather informally specified so it is not clear if that is a problem with the file or with my implementation. However, I will try to make the parser more tolerant because loading data from OpenML is my most important use case :)
I agree that a few examples would be nice to have. You got the code almost right, though.
This lets you load the data into nested Vec
s:
let unnamed_data: Vec<Vec<f32>> = arff::from_str(&contents).unwrap();
(Note that this fragments the data in memory: columns (images in this case) are contiguous but rows are not.)
Alternatively, if you want to load the whole data set into a contiguous block of memory:
let unnamed_data: Vec<f32> = arff::flat_from_str(&contents).unwrap();
Thanks for your response!
I can confirm that adding a
let contents = contents.replace("\t", " ");
fixes the issue.
So on my laptop this loads MNIST in 4.6s as compared to 17s with liac-arff
which is quite nice (though I guess there is additional overhead to be expected in converting the output to a contiguous array).
Good to hear about the performance difference. Your laptop is fast... This takes about 6s on my desktop machine.
I would have expected loading into a contiguous array to be faster but it turns out to be slightly slower than the Vec<Vec<_>>
variant. Perhaps it needs to reallocate more often as the size is not known in advance. I'm not familiar enough with Serde internals to say for sure.
Thanks for this crate!
I'm trying to load the MNIST dataset from OpenML in Arff format, and so far
panics due to column dtype validation,
possibly because the metadata says that colums are real, while they are integers.
It would be nice if there was an example of loading a dataset from OpenML. A am aware of openml-rust but I'm looking for just a fast ARFF parser that I could use as a replacement for liac-arff.