Bag of words - what is the delimiter?

Overload119 commented 1 year ago

Consider a table:

target	words
1	This, That, And The Other
0	This
1	And The Other, That

Am I using the commas to infer the bag of words correctly?

isabella commented 1 year ago

The tokenizer will tokenize the string in the following way:	words	tokens
This, That, And the Other	`this` `,` `that` `,` `and` `the` `other`

It's not splitting text into tokens using a comma delimiter.

If you want the behavior to instead be three tokens This, That, And The Other, I suggest preprocessing those columns and pass text that has already been feature engineered.

Overload119 commented 1 year ago

Do you have an example of how that would work? How can I pass text in any other way in the column?

isabella commented 1 year ago

You would need to pre-process your csv using another tool. Alternatively, you can use an enum column by using a custom config file as described here: https://www.modelfox.dev/docs/guides/train_with_custom_configuration.

In the example linked above, the "chest_pain" column is specified as type "enum" with four variants.

{
  "dataset": {
    "columns": [
    {
      "name": "chest_pain",
      "type": "enum",
      "variants": [
        "asymptomatic",
        "atypical angina",
        "non-angina pain",
        "typical angina"
      ]
    },
...
  }
}

For your dataset, you would specify that the words column is an enum with 3 variants: "This", "That", "And The Other".

Then, use the config file by passing --config path/to/config.json on the CLI.

modelfoxdotdev / modelfox

Bag of words - what is the delimiter? #129