paracrawl / Domain_Adaptation

InDomain detection is a tool designed to extract in-domain data from a large collections of data.
GNU General Public License v3.0
1 stars 1 forks source link

User-specified tokenizer #7

Closed kpu closed 5 years ago

kpu commented 5 years ago

The standard in ParaCrawl is that the user can specify a tokenizer rather than it being hard-coded into the packages. This way some user can handle Chinese etc.
I suspect this will be moot once integrated into bitextor since it should be tokenizing for you anyway.

dionwiggins commented 5 years ago

Any tokenizer can be used. See https://github.com/paracrawl/Domain_Adaptation#tokenizer