renepickhardt / generalized-language-modeling-toolkit

Generalized Language Modeling toolkit
http://glm.rene-pickhardt.de
52 stars 17 forks source link

How to treat reserved symbols in Training and Querying files #71

Open lschmelzeisen opened 9 years ago

lschmelzeisen commented 9 years ago

Currently reserved symbols are _ (absolute skip), % (continuation skip) / (token-pos-separator).

IIRC the program fails if any of these are contained in training or querying files.

How do we cope with this isse?

lschmelzeisen commented 9 years ago

Commit 9e4c6a7e740eaa55183431a5748fe31e445054b4 scans corpus for reserved symbols and refuses execution if it contains any.

However I'n the long run I would like to have some form of escaping the input to make it transparent for the user.