ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
657 stars 195 forks source link

How to use punctuator2 tools to process/train/test chinese data? #43

Open shenxuhui opened 5 years ago

shenxuhui commented 5 years ago

I want to use punctuator2 to punctuate chinese txt data, and have no idea how to process my chinese txt data.

chinese txt data have these feature different from english txt data:

  1. chinese txt data have no space between characters.
  2. chinese txt data is utf-8 format. 3.chinese txt data have different punctuation system with english. BUT can change to COMMA and PERIOD in english.

I process my data like english (add space between chinese character and substitute the ',' to ',COMMA' and so on), but get an error:

256 0.02 Model_./models/hello.mdl_h256_lr0.02.pcl Building model... Number of parameters is 2049032 Training... WARNING: Not enough samples in '../data/train'. Reduce mini-batch size to 0 or use a dataset with at least 6400 words. Total number of training labels: 0 WARNING: Not enough samples in '../data/dev'. Reduce mini-batch size to 0 or use a dataset with at least 6400 words. Total number of validation labels: 0 Traceback (most recent call last): File "main.py", line 202, in <module> ppl = np.exp(total_neg_log_likelihood / total_num_output_samples)

Thanks for your reading this issue. The sincerity anticipates your reply.