jiesutd / LatticeLSTM

Chinese NER using Lattice LSTM. Code for ACL 2018 paper.
1.8k stars 453 forks source link

What is the raw data? #11

Closed MrRace closed 6 years ago

MrRace commented 6 years ago

(1)We can feed three kinds of parameter:"train","test" and "decode" to the main.py. In "train" step you have use "dev" set to choose best mode and save it. It seems that you use the "test" data to print the model'performance each iteration. Am I right? When status="test",you also use the "dev" data and "test" data to show the model'performance, but your have used them during trainning stage. Is that OK? (2)In the main.py you mention "raw" data when status argument is "decode".Where to get the "raw" data?

jiesutd commented 6 years ago

1) The "dev" and "test" are not involved in the training process. They are only used to evaluate the model performance which does not affect the training. Otherwise, you can save your models after each iteration and use status=test to evaluate them one by one, it will give the same result. I just write them together for convenience.

2) Raw data means any data. You can use any data you want to decode. Model is trained to work in the real world but not only in standard train/dev/test datasets. You can use your trained model to decode any data you want (other domains, Weibo, etc.).

MrRace commented 6 years ago

I just put a string of Chinese "科学家袁隆平" into Raw data.Some errors come:

 File "main.py", line 445, in <module>
    data.generate_instance_with_gaz(raw_file,'raw')
  File "/root/LatticeLSTM/utils/data.py", line 268, in generate_instance_with_gaz
    self.raw_texts, self.raw_Ids = read_instance_with_gaz(input_file, self.gaz, self.word_alphabet,self.biword_alphabet, self.char_alphabet, self.gaz_alphabet,  self.label_alphabet, self.number_normalized, self.MAX_SENTENCE_LENGTH)
  File "/root/LatticeLSTM/utils/functions.py", line 160, in read_instance_with_gaz
    label_Ids.append(label_alphabet.get_index(label))
  File "/root/LatticeLSTM/utils/alphabet.py", line 54, in get_index
    return self.instance2index[self.UNKNOWN]
KeyError: '</unk>'

In the beginning I think may be exist some keys out of the dictionary. When I use a string "吴重阳,中国国籍" existed in dev data,it comes the same error. By reading the code it seems need a "label" for the raw data??So, Should I provide a label message,and what is the organizational form?A example may help to get quick.Thanks so much.

jiesutd commented 6 years ago

hi @MrRace , yes! I forget to say that, the raw data need to follow the same format as the training/dev/test dataset. That is :

Char1 label1
Char2 label2 
...

For your raw data without label, just use O as the label. i.e.

科 O
学 O
家 O
袁 O
隆 O
平 O

Notice the evaluation on raw data without gold labels is no correct, because there exists no gold standard label.