lixin4ever / TNet

Transformation Networks for Target-Oriented Sentiment Classification (ACL 2018)
https://arxiv.org/abs/1805.01086
141 stars 31 forks source link

Maybe I find a bug in data preprocessing stage #4

Closed Echoyyyy closed 5 years ago

Echoyyyy commented 5 years ago

hello, lixin. I'm interested in your code. And I try to read and reimplement your code. However, I find a issue. In utils.py read() function,

words.append(t.strip(end)))
target_words.append(t.strip(end))) 

these codes will remove the letter ‘n’, 'p', and '0' located in the beginning and end of the words. Such as:

s = 'sheen/n'
print(s.strip('/n'))

shee

Hence, this will lead data error, and then the input of TNet maybe error. May i know what is this all about?why are you doing this,please?

lixin4ever commented 5 years ago

Yes, Thanks for pointing out this problem.

Here, the statements .strip("\n"), .strip("\p") and .strip("\0") are to remove the sentiment tag on the target word (please check the data files).

As mentioned in your comments, these statements will indeed additionally remove the leading and the ending 'n', 'p' characters in the target word, which is ignored in our experiments and may introduce some OOV words. You can fix this in your own experiments.

Echoyyyy commented 5 years ago

Not like this. This problem not only removes the leading and the ending 'n', 'p' characters in the target word, but every word in the sentence will lost 'n', 'p' characters.

And,I fix this error in your code. To be astonished, TNet doesn't work. However, TNet with error input datas works very well.

I think the reason why TNet performs so good is the mistakes in data processing.

lixin4ever commented 5 years ago

In the original code, only the target words (i.e., the words containing strings "/p", "/n" or "/0") will be stripped with the variable "end". When the word is not target word, we do nothing.

This preprocessing error is accidental and the reason is that I use the .strip() function in an incorrect way.

Directly removing the last two characters for the target words should be correct.

I also fix this in our code and our model can also outperform the state-of-the-art methods on Twitter(~73.0, acc) and laptop14 (~75.5, acc).

I guess the reason why you obtain the poor performances after fixing this preprocessing error is that you are still using the previously built word embedding files (i.e., xxx.pkl). As we obtain the new vocabulary, the word embedding files should be rebuilt from the pre-trained word embeddings. Also, fine-tuning the word embeddings during training may improve the performances (not test yet).

If you have any problem, feel free to add comments here or contact me via email.

Echoyyyy commented 5 years ago

oh, Sorry. It is my mistake. It is true that this error occurs when word is target.

I replaced .strip(end) with .split(end)[0], and the model outperforms 72.4(acc) on Twitter and 72.16 (acc) on Laptop.

Could you tell me how to fix this error, please?

lixin4ever commented 5 years ago

The result on Laptop dataset is abnormal. You can try to align your settings (e.g., version of theano, pygpu, cuda and cudnn) to those in the README.