modelfoxdotdev / modelfox

ModelFox makes it easy to train, deploy, and monitor machine learning models.
Other
1.46k stars 63 forks source link

Be able to specify the maximum number of unique values for an enum #91

Open spullara opened 2 years ago

spullara commented 2 years ago

I am getting text search instead of an enum by default for a column that has 117 unique values (out of the 18k or so samples provided).

isabella commented 2 years ago

hi @spullara. You should definitely be able to configure the max unique values so that your column with 117 unique values would be an enum column. Currently, the only way to do that is to pass a config file with the column name, type, and a list of all of the variants. There are two potential implementations that would achieve what you want:

  1. In the config file, allow passing a json object that includes the csv infer options:
    
    #[derive(Clone)]
    pub struct FromCsvOptions<'a> {
    pub column_types: Option<BTreeMap<String, TableColumnType>>,
    pub infer_options: InferOptions,
    pub invalid_values: &'a [&'a str],
    }

impl<'a> Default for FromCsvOptions<'a> { fn default() -> FromCsvOptions<'a> { FromCsvOptions { column_types: None, infer_options: InferOptions::default(), invalid_values: DEFAULT_INVALID_VALUES, } } }

[derive(Clone, Debug)]

pub struct InferOptions { pub enum_max_unique_values: usize, }

impl Default for InferOptions { fn default() -> InferOptions { InferOptions { enum_max_unique_values: 100, } } }


2. Allow passing the column name and type but not force the user to pass the all unique variants in a list. 

I think option 2 is probably closer to the interface might be looking for? This way you get to configure the type per column but don't have to pass all of the variants (which for enums with high numbers of options is cumbersome).
spullara commented 2 years ago

I think just labelling the column an enum without having to list the values would be great.