jblondin / csv-sniffer

CSV sniffer crate for Rust
MIT License
7 stars 1 forks source link

Invalid delimiter found #18

Open thomas9911 opened 1 year ago

thomas9911 commented 1 year ago

Using this dataset: https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-100000.csv

The sniffer finds the delimiter : while it is clearly ,.

Index,Organization Id,Name,Website,Country,Description,Founded,Industry,Number of employees
1,8cC6B5992C0309c,Acevedo LLC,https://www.donovan.com/,Holy See (Vatican City State),Multi-channeled bottom-line core,2019,Graphic Design / Web Design,7070
2,ec094061FeaF7Bc,Walls-Mcdonald,http://arias-willis.net/,Lithuania,Compatible encompassing groupware,2005,Utilities,8156
3,DAcC5dbc58946A7,Gregory PLC,http://www.lynch-hoover.net/,Tokelau,Multi-channeled intangible help-desk,2019,Leisure / Travel,6121
4,8Dd7beDa37FbeD0,"Byrd, Patterson and Knox",https://www.james-velez.net/,Netherlands,Pre-emptive national function,1982,Furniture,3494
5,a3b5c54AEC163e4,Mcdowell-Hopkins,http://fuentes.com/,Mayotte,Cloned bifurcated solution,2016,Online Publishing,36
6,fDfEBeFDaEb59Af,Hayden and Sons,https://www.shaw-mooney.info/,Belize,Persistent mobile task-force,1978,Insurance,7010
7,752ef90Eae1f7f5,Castro LLC,http://wilkinson.com/,Jamaica,Advanced value-added definition,2008,Outsourcing / Offshoring,2526

Code (similar to the example found in this repo):

extern crate csv_sniffer;

use std::path::Path;

use csv_sniffer::{SampleSize, Sniffer};

fn main() {
    let data_filepath = Path::new(file!())
        .parent()
        .unwrap()
        .join("../data.csv");
    let dialect = Sniffer::new()
        .sample_size(SampleSize::All)
        .sniff_path(data_filepath)
        .unwrap();
    println!("{:#?}", dialect);
}

output:

Metadata {
    dialect: Dialect {
        delimiter: ':',
        header: Header {
            has_header_row: true,
            num_preamble_rows: 1,
        },
        quote: None,
        flexible: false,
    },
    num_fields: 2,
    types: [
        Text,
        Text,
    ],
}

Is this a known issue?

ps: with another sample_size it also gives the wrong delimiter.