modelfoxdotdev / modelfox

ModelFox makes it easy to train, deploy, and monitor machine learning models.
Other
1.46k stars 63 forks source link

Bag of words - what is the delimiter? #129

Open Overload119 opened 1 year ago

Overload119 commented 1 year ago

Consider a table:

target words
1 This, That, And The Other
0 This
1 And The Other, That

Am I using the commas to infer the bag of words correctly?

isabella commented 1 year ago
The tokenizer will tokenize the string in the following way: words tokens
This, That, And the Other this , that , and the other

It's not splitting text into tokens using a comma delimiter.

If you want the behavior to instead be three tokens This, That, And The Other, I suggest preprocessing those columns and pass text that has already been feature engineered.

Overload119 commented 1 year ago

Do you have an example of how that would work? How can I pass text in any other way in the column?

isabella commented 1 year ago

You would need to pre-process your csv using another tool. Alternatively, you can use an enum column by using a custom config file as described here: https://www.modelfox.dev/docs/guides/train_with_custom_configuration.

In the example linked above, the "chest_pain" column is specified as type "enum" with four variants.

{
  "dataset": {
    "columns": [
    {
      "name": "chest_pain",
      "type": "enum",
      "variants": [
        "asymptomatic",
        "atypical angina",
        "non-angina pain",
        "typical angina"
      ]
    },
...
  }
}

For your dataset, you would specify that the words column is an enum with 3 variants: "This", "That", "And The Other".

Then, use the config file by passing --config path/to/config.json on the CLI.