Open Devourian opened 1 year ago
For uint8 you need to check that you've enabled the rust feature flag for that data type
Could you see which problems persist if you use the python api?
Thanks @kylebarron I will check it out, as far as I understand I need to use dtype-u8
feature flag?
Do you know where can I find documentation about polars
feature flags?
I was only being able to find the list of all polars
feature flags, but without explanation.
Hi @ritchie46, no problem I will check it out, as well as @kylebarron suggestion regarding UInt8 support.
I will probably do it today at 8 PM, CET, UTC+1.
Thank you both for support š¤
For uint8 you need to check that you've enabled the rust feature flag for that data type
Sadly, it still doesn't work even after adding dtype-u8
feature flag
My polars
entry in Cargo.toml
file:
polars = { version="0.26.1", features=["dtype-u8"] }
I have created unit-test to test this behaviour and it fails - now anyone can reproduce it:
#[cfg(test)]
mod test_polars_csv_reader {
use std::{
fs,
path::Path,
};
use polars::{
prelude::*,
datatypes::DataType::{UInt8, Float64},
};
fn write_file(path: &Path, data: &str) {
return fs::write(path, data).expect("Unable to write file");
}
fn prepare_schema() -> Schema {
let mut schema: Schema = Schema::new();
Schema::new();
schema.with_column(String::from("MPG"), Float64);
schema.with_column(String::from("Cylinders"), UInt8);
schema.with_column(String::from("Displacement"), Float64);
schema.with_column(String::from("Horsepower"), Float64);
schema.with_column(String::from("Weight"), Float64);
schema.with_column(String::from("Acceleration"), Float64);
schema.with_column(String::from("Model Year"), UInt8);
schema.with_column(String::from("Origin"), UInt8);
return schema;
}
#[test]
fn test_uint8_fields_are_correctly_loaded() {
// Arrange
let data_to_save = "\
18.0,8,307.0,130.0,3504.,12.0,70,1\n\
15.0,8,350.0,165.0,3693.,11.5,70,1\n\
";
let dataset_file_path = Path::new("auto-mpg.data");
write_file(dataset_file_path, data_to_save);
let schema = prepare_schema();
let expected_data: DataFrame = df!(
"MPG" => &[18.0f64, 15.0f64],
"Cylinders" => &[8u8, 8u8],
"Displacement" => &[307.0f64, 350.0f64],
"Horsepower" => &[130.0f64, 165.0f64],
"Weight" => &[3504.0f64, 3693.0f64],
"Acceleration" => &[12.0f64, 11.5f64],
"Model Year" => &[70u8, 70u8],
"Origin" => &[1u8, 1u8],
).unwrap();
// Act
let data = CsvReader::from_path(dataset_file_path)
.unwrap()
.has_header(false)
.with_delimiter(b',')
.with_schema(&schema)
.finish()
.unwrap();
// Assert
assert_eq!(expected_data, data);
}
}
Now everyone can reproduce the bug, it results in following error:
thread 'test_polars_csv_reader::test_uint8_fields_are_correctly_loaded' panicked at 'called `Result::unwrap()` on an `Err` value: ComputeError(Owned("Unsupported data type UInt8 when reading a csv"))', src/main.rs:69:14
stack backtrace:
0: rust_begin_unwind
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
2: core::result::unwrap_failed
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/result.rs:1791:5
3: core::result::Result<T,E>::unwrap
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/result.rs:1113:23
4: fuel_efficiency_predictor::test_polars_csv_reader::test_uint8_fields_are_correctly_loaded
at ./src/main.rs:62:20
5: fuel_efficiency_predictor::test_polars_csv_reader::test_uint8_fields_are_correctly_loaded::{{closure}}
at ./src/main.rs:40:5
6: core::ops::function::FnOnce::call_once
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/ops/function.rs:251:5
7: core::ops::function::FnOnce::call_once
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test test_polars_csv_reader::test_uint8_fields_are_correctly_loaded ... FAILED
I have debugged it slightly and it looks like this module polars-io-0.26.1/src/csv/buffer.rs
and this lines are causing the program to panic:
let builder = match dtype {
&DataType::Boolean => Buffer::Boolean(BooleanChunkedBuilder::new(name, capacity)),
&DataType::Int32 => Buffer::Int32(PrimitiveChunkedBuilder::new(name, capacity)),
&DataType::Int64 => Buffer::Int64(PrimitiveChunkedBuilder::new(name, capacity)),
&DataType::UInt32 => Buffer::UInt32(PrimitiveChunkedBuilder::new(name, capacity)),
&DataType::UInt64 => Buffer::UInt64(PrimitiveChunkedBuilder::new(name, capacity)),
&DataType::Float32 => Buffer::Float32(PrimitiveChunkedBuilder::new(name, capacity)),
&DataType::Float64 => Buffer::Float64(PrimitiveChunkedBuilder::new(name, capacity)),
&DataType::Utf8 => Buffer::Utf8(Utf8Field::new(
name,
capacity,
str_capacity,
quote_char,
encoding,
ignore_errors,
)),
#[cfg(feature = "dtype-datetime")]
&DataType::Datetime(tu, _) => Buffer::Datetime {
buf: DatetimeField::new(name, capacity),
tu,
},
#[cfg(feature = "dtype-date")]
&DataType::Date => Buffer::Date(DatetimeField::new(name, capacity)),
#[cfg(feature = "dtype-categorical")]
&DataType::Categorical(_) => {
Buffer::Categorical(CategoricalField::new(name, capacity, quote_char))
}
other => {
return Err(PolarsError::ComputeError(
format!("Unsupported data type {other:?} when reading a csv").into(),
))
}
};
Ok(builder)
It seems that UInt8
type goes into other
branch of match
statement and it results in ComputeError
Perhaps this is useful information:
Using .with_dtypes(Some(&schema))
instead of .with_schema()
generated a different error:
`Err` value: NotFound(Owned("Cylinders"))'
(The .with_schema()
docs say "It is recommended to use with_dtypes instead." - I'm not sure why?)
I changed the names to column_1
.. column_8
schema.with_column(String::from("column_1"), Float64);
schema.with_column(String::from("column_2"), UInt8);
schema.with_column(String::from("column_3"), Float64);
schema.with_column(String::from("column_4"), Float64);
schema.with_column(String::from("column_5"), Float64);
schema.with_column(String::from("column_6"), Float64);
schema.with_column(String::from("column_7"), UInt8);
schema.with_column(String::from("column_8"), UInt8);
And it created the dataframe without error.
To rename the columns with .set_column_names()
I had to make it mutable:
let mut data = ...;
data.set_column_names();
Using .with_dtypes_slice()
resulted in the same ComputeError(Owned("Unsupported data type UInt8 when reading a csv"))
error as .with_schema()
- I guess it hits the same codepath.
In Python read_csv()
has a new_columns
parameter which also worked:
pl.read_csv(io.StringIO(csv), has_header=False, new_columns=list(schema), dtypes=schema)
https://github.com/pola-rs/polars/pull/7290 fixed the lack of support for 8 & 16 bytes for normal reader, but it still fails for read_csv_batched
Polars version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
I have found multiple issues when trying to read classic Auto MPG dataset from Tensorflow
Basic regression: Predict fuel efficiency
tutorial.There you can see that this dataset was nicely loaded with
pandas.read_csv
methodI was trying to achieve the same behaviour using Rust version of
polars
, but I was blocked by multiple issues:1. Multiple-character delimiters
This dataset file columns are delimited by multiple spaces, I couldn't set multiple spaces as delimiters nor I couldn't find the substitute of pandas
skipinitialspace
argument inpolars CsvReader
, so I have set single space as a delimiter:.with_delimiter(b' ')
, but the file wasn't loaded properly, even without schema.With schema provided, it doesn't even load at all (which is expected I think). I thought that feature of multi-length delimiters isn't implemented in
polars
, but I found this issue on this topic, so I think it should work inpolars = "0.26.1"
which I'm using.After pre-parsing the dataset file (which I would want to avoid), changing delimiters to single
,
character, I was able to load the file correctly, but found other issues2. Comment char not working
Comment char
.with_comment_char(Some(b'\t'))
seems to be not working with this dataset, but it probably should.3.
UInt8
type for column values doesn't workSchema seems to be not working when it comes to
UInt8
typed values, when the column contains values which should fit in theUInt8
, e.g.origin
column no. 7 (counting from 0).It works when
Int64
is specified as the type of this column in the schema, but doesn't work forUInt8
- a lot of space could be saved if that would work as far as I understand.Maybe these issues come from my misunderstanding of
polars
usage or my small experience will Rust language. I would be thankful if someone from development team would confirm the issues or point me towards the result that I want to achieve.Reproducible example
Paste code below into
main.rs
and put alongside the extractedauto-mpg.data
dataset file, then runmain.rs
.auto-mpg.zip
Expected behavior
The file should be loaded correctly without pre-parsing like when using in pandas the Tensorflow tutorial
Installed versions