pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.68k stars 1.9k forks source link

`polars::frame::csv::CsvReader` multiple issues #6276

Open Devourian opened 1 year ago

Devourian commented 1 year ago

Polars version checks

Issue description

I have found multiple issues when trying to read classic Auto MPG dataset from Tensorflow Basic regression: Predict fuel efficiency tutorial.

There you can see that this dataset was nicely loaded with pandas.read_csv method

I was trying to achieve the same behaviour using Rust version of polars, but I was blocked by multiple issues:

1. Multiple-character delimiters

This dataset file columns are delimited by multiple spaces, I couldn't set multiple spaces as delimiters nor I couldn't find the substitute of pandas skipinitialspace argument in polars CsvReader, so I have set single space as a delimiter: .with_delimiter(b' '), but the file wasn't loaded properly, even without schema.

With schema provided, it doesn't even load at all (which is expected I think). I thought that feature of multi-length delimiters isn't implemented in polars, but I found this issue on this topic, so I think it should work in polars = "0.26.1" which I'm using.


After pre-parsing the dataset file (which I would want to avoid), changing delimiters to single , character, I was able to load the file correctly, but found other issues

2. Comment char not working

Comment char .with_comment_char(Some(b'\t')) seems to be not working with this dataset, but it probably should.

3. UInt8 type for column values doesn't work

Schema seems to be not working when it comes to UInt8 typed values, when the column contains values which should fit in the UInt8, e.g. origin column no. 7 (counting from 0).

It works when Int64 is specified as the type of this column in the schema, but doesn't work for UInt8 - a lot of space could be saved if that would work as far as I understand.


Maybe these issues come from my misunderstanding of polars usage or my small experience will Rust language. I would be thankful if someone from development team would confirm the issues or point me towards the result that I want to achieve.

Reproducible example

Paste code below into main.rs and put alongside the extracted auto-mpg.data dataset file, then run main.rs.

use std::path::Path;

use polars::prelude::*;
use polars::datatypes::DataType::{Int64, Float64};

fn main() {
    let dataset_file_path: &Path = Path::new("auto-mpg.data");
    let dataset: DataFrame = read_data_frame_from_csv(dataset_file_path);
    println!("{dataset:#?}");
}

fn read_data_frame_from_csv(
    csv_file_path: &Path,
) -> DataFrame {
    let mut schema: Schema = Schema::new();
    schema.with_column(String::from("MPG"), Float64);
    schema.with_column(String::from("Cylinders"), Int64);
    schema.with_column(String::from("Displacement"), Float64);
    schema.with_column(String::from("Horsepower"), Float64);
    schema.with_column(String::from("Weight"), Float64);
    schema.with_column(String::from("Acceleration"), Float64);
    schema.with_column(String::from("Model Year"), Int64);
    schema.with_column(String::from("Origin"), Int64);

    return CsvReader::from_path(csv_file_path)
        .unwrap()
        .has_header(false)
        .with_delimiter(b' ')
        .with_comment_char(Some(b'\t'))
        .with_null_values(Some(NullValues::AllColumnsSingle(String::from("?"))))
        .with_schema(&schema)
        .finish()
        .unwrap();
}

auto-mpg.zip

Expected behavior

The file should be loaded correctly without pre-parsing like when using in pandas the Tensorflow tutorial

Installed versions

polars = "0.26.1"
kylebarron commented 1 year ago

For uint8 you need to check that you've enabled the rust feature flag for that data type

ritchie46 commented 1 year ago

Could you see which problems persist if you use the python api?

Devourian commented 1 year ago

Thanks @kylebarron I will check it out, as far as I understand I need to use dtype-u8 feature flag?

Do you know where can I find documentation about polars feature flags?

I was only being able to find the list of all polars feature flags, but without explanation.


Hi @ritchie46, no problem I will check it out, as well as @kylebarron suggestion regarding UInt8 support.

I will probably do it today at 8 PM, CET, UTC+1.

Thank you both for support šŸ¤

Devourian commented 1 year ago

For uint8 you need to check that you've enabled the rust feature flag for that data type

Sadly, it still doesn't work even after adding dtype-u8 feature flag My polars entry in Cargo.toml file:

polars = { version="0.26.1", features=["dtype-u8"] }

I have created unit-test to test this behaviour and it fails - now anyone can reproduce it:

#[cfg(test)]
mod test_polars_csv_reader {
    use std::{
        fs,
        path::Path,
    };

    use polars::{
        prelude::*,
        datatypes::DataType::{UInt8, Float64},
    };

    fn write_file(path: &Path, data: &str) {
        return fs::write(path, data).expect("Unable to write file");
    }

    fn prepare_schema() -> Schema {
        let mut schema: Schema = Schema::new();
        Schema::new();

        schema.with_column(String::from("MPG"), Float64);
        schema.with_column(String::from("Cylinders"), UInt8);
        schema.with_column(String::from("Displacement"), Float64);
        schema.with_column(String::from("Horsepower"), Float64);
        schema.with_column(String::from("Weight"), Float64);
        schema.with_column(String::from("Acceleration"), Float64);
        schema.with_column(String::from("Model Year"), UInt8);
        schema.with_column(String::from("Origin"), UInt8);

        return schema;
    }

    #[test]
    fn test_uint8_fields_are_correctly_loaded() {
        // Arrange
        let data_to_save = "\
            18.0,8,307.0,130.0,3504.,12.0,70,1\n\
            15.0,8,350.0,165.0,3693.,11.5,70,1\n\
        ";
        let dataset_file_path = Path::new("auto-mpg.data");
        write_file(dataset_file_path, data_to_save);

        let schema = prepare_schema();
        let expected_data: DataFrame = df!(
            "MPG" => &[18.0f64, 15.0f64],
            "Cylinders" => &[8u8, 8u8],
            "Displacement" => &[307.0f64, 350.0f64],
            "Horsepower" => &[130.0f64, 165.0f64],
            "Weight" => &[3504.0f64, 3693.0f64],
            "Acceleration" => &[12.0f64, 11.5f64],
            "Model Year" => &[70u8, 70u8],
            "Origin" => &[1u8, 1u8],
        ).unwrap();

        // Act
        let data = CsvReader::from_path(dataset_file_path)
            .unwrap()
            .has_header(false)
            .with_delimiter(b',')
            .with_schema(&schema)
            .finish()
            .unwrap();

        // Assert
        assert_eq!(expected_data, data);
    }
}

Now everyone can reproduce the bug, it results in following error:

thread 'test_polars_csv_reader::test_uint8_fields_are_correctly_loaded' panicked at 'called `Result::unwrap()` on an `Err` value: ComputeError(Owned("Unsupported data type UInt8 when reading a csv"))', src/main.rs:69:14
stack backtrace:
   0: rust_begin_unwind
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
   2: core::result::unwrap_failed
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/result.rs:1791:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/result.rs:1113:23
   4: fuel_efficiency_predictor::test_polars_csv_reader::test_uint8_fields_are_correctly_loaded
             at ./src/main.rs:62:20
   5: fuel_efficiency_predictor::test_polars_csv_reader::test_uint8_fields_are_correctly_loaded::{{closure}}
             at ./src/main.rs:40:5
   6: core::ops::function::FnOnce::call_once
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/ops/function.rs:251:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test test_polars_csv_reader::test_uint8_fields_are_correctly_loaded ... FAILED

I have debugged it slightly and it looks like this module polars-io-0.26.1/src/csv/buffer.rs and this lines are causing the program to panic:

            let builder = match dtype {
                &DataType::Boolean => Buffer::Boolean(BooleanChunkedBuilder::new(name, capacity)),
                &DataType::Int32 => Buffer::Int32(PrimitiveChunkedBuilder::new(name, capacity)),
                &DataType::Int64 => Buffer::Int64(PrimitiveChunkedBuilder::new(name, capacity)),
                &DataType::UInt32 => Buffer::UInt32(PrimitiveChunkedBuilder::new(name, capacity)),
                &DataType::UInt64 => Buffer::UInt64(PrimitiveChunkedBuilder::new(name, capacity)),
                &DataType::Float32 => Buffer::Float32(PrimitiveChunkedBuilder::new(name, capacity)),
                &DataType::Float64 => Buffer::Float64(PrimitiveChunkedBuilder::new(name, capacity)),
                &DataType::Utf8 => Buffer::Utf8(Utf8Field::new(
                    name,
                    capacity,
                    str_capacity,
                    quote_char,
                    encoding,
                    ignore_errors,
                )),
                #[cfg(feature = "dtype-datetime")]
                &DataType::Datetime(tu, _) => Buffer::Datetime {
                    buf: DatetimeField::new(name, capacity),
                    tu,
                },
                #[cfg(feature = "dtype-date")]
                &DataType::Date => Buffer::Date(DatetimeField::new(name, capacity)),
                #[cfg(feature = "dtype-categorical")]
                &DataType::Categorical(_) => {
                    Buffer::Categorical(CategoricalField::new(name, capacity, quote_char))
                }
                other => {
                    return Err(PolarsError::ComputeError(
                        format!("Unsupported data type {other:?} when reading a csv").into(),
                    ))
                }
            };
            Ok(builder)

It seems that UInt8 type goes into other branch of match statement and it results in ComputeError

cmdlineluser commented 1 year ago

Perhaps this is useful information:

Using .with_dtypes(Some(&schema)) instead of .with_schema() generated a different error:

`Err` value: NotFound(Owned("Cylinders"))'

(The .with_schema() docs say "It is recommended to use with_dtypes instead." - I'm not sure why?)

I changed the names to column_1 .. column_8

schema.with_column(String::from("column_1"), Float64);
schema.with_column(String::from("column_2"), UInt8);
schema.with_column(String::from("column_3"), Float64);
schema.with_column(String::from("column_4"), Float64);
schema.with_column(String::from("column_5"), Float64);
schema.with_column(String::from("column_6"), Float64);
schema.with_column(String::from("column_7"), UInt8);
schema.with_column(String::from("column_8"), UInt8);

And it created the dataframe without error.

To rename the columns with .set_column_names() I had to make it mutable:

let mut data = ...;
data.set_column_names();

Using .with_dtypes_slice() resulted in the same ComputeError(Owned("Unsupported data type UInt8 when reading a csv")) error as .with_schema() - I guess it hits the same codepath.

In Python read_csv() has a new_columns parameter which also worked:

pl.read_csv(io.StringIO(csv), has_header=False, new_columns=list(schema), dtypes=schema)
dpinol commented 11 months ago

https://github.com/pola-rs/polars/pull/7290 fixed the lack of support for 8 & 16 bytes for normal reader, but it still fails for read_csv_batched