tdozat / Parser-v2

An updated version of the Parser-v1 repo, used for Stanford's submission in the CoNLL17 shared task.
47 stars 34 forks source link

Training set sometimes required for parsing #1

Open tdozat opened 7 years ago

tdozat commented 7 years ago

The model saves a list of all the tokens in the vocabulary in save_dir/words.txt. If there's a case mismatch between the character model and the token model--that is, if you want the character model to be cased and the word vocabulary to be caseless--it reads through the training set to build up the character vocabulary. This is a problem when you only want to parse and the training set isn't available.

Solution: modify the code to save cased and caseless vocabularies in save_dir/words-cased.txt and save_dir/words-caseless.txt, and at parse time load whichever one is dictated by the cased configuration setting.