glample / tagger

Named Entity Recognition Tool
Apache License 2.0
1.16k stars 426 forks source link

Can my Chinese data be used in this program?(character-level) #75

Closed PCR11 closed 6 years ago

PCR11 commented 6 years ago

Thanks for your share of this program,it is very useful for most people. I have implemented it with the english corpus that you shared. Because of my hardware so i fix the parameter --char_dim form 25 to 5 and --word_dim form 100 to 10,then i get the result : 44424/46435 (95.66922%) Score on dev: 88.03000 Score on test: 81.13000 13950, cost average: 0.043258 14000, cost average: 0.104217 Epoch 99 done. Average cost: 0.044536 Is it a normal result? and i read your paper it said the word representations are generated from the characters they are composed of. Is it mean the input of english word will be separated to characters and then generate a new word embeddings within bi-lstm? And now i want to ues it with my chinese data, the format of my data 给 O 予 O 局 B-T 部 I-T 抗 I-T 炎 I-T It is like your english corpus's format, but english is one english word one line,and my data is one chinese character one line ,Can the data of this format be used in this program? I am looking forward to your help.Thank you very much.

PCR11 commented 6 years ago

@glample

PCR11 commented 6 years ago

@pvcastro

pvcastro commented 6 years ago

Sorry @PCR11 , I don't know chinese enough to indicate how you should adapt your corpus to run the network, but I think it's definetely possible :thinking: