Open Overload119 opened 1 year ago
The tokenizer will tokenize the string in the following way: | words | tokens |
---|---|---|
This, That, And the Other | this , that , and the other |
It's not splitting text into tokens using a comma delimiter.
If you want the behavior to instead be three tokens This
, That
, And The Other
, I suggest preprocessing those columns and pass text that has already been feature engineered.
Do you have an example of how that would work? How can I pass text in any other way in the column?
You would need to pre-process your csv using another tool. Alternatively, you can use an enum
column by using a custom config file as described here: https://www.modelfox.dev/docs/guides/train_with_custom_configuration.
In the example linked above, the "chest_pain" column is specified as type "enum" with four variants.
{
"dataset": {
"columns": [
{
"name": "chest_pain",
"type": "enum",
"variants": [
"asymptomatic",
"atypical angina",
"non-angina pain",
"typical angina"
]
},
...
}
}
For your dataset, you would specify that the words
column is an enum
with 3 variants: "This", "That", "And The Other".
Then, use the config file by passing --config path/to/config.json
on the CLI.
Consider a table:
Am I using the commas to infer the bag of words correctly?