Is there a way to do ngram estimation with custom token separation? The idea would be to get the following behavior:
Hi, this is a sentence. -> Hi, ,, this, is, a, sentence, .My email is frodo@shire.com. -> My, email, is, frodo, @, shire, ., com, .
Another option could be to treat certain characters such as ., , or @ as additional whitespace characters (maybe through the --skip_symbols flag?) to get an ngram estimate as if those characters were whitespace. Is there more documentation on that flag or am I misunderstanding the use?
Is there a way to do ngram estimation with custom token separation? The idea would be to get the following behavior:
Hi, this is a sentence.
->Hi
,,
,this
,is
,a
,sentence
,.
My email is frodo@shire.com.
->My
,email
,is
,frodo
,@
,shire
,.
,com
,.
Another option could be to treat certain characters such as.
,,
or@
as additional whitespace characters (maybe through the --skip_symbols flag?) to get an ngram estimate as if those characters were whitespace. Is there more documentation on that flag or am I misunderstanding the use?