teapot123 / JASen

Code and Data for our EMNLP-2020 paper Weakly-Supervised Aspect-Based Sentiment Analysis via Joint Aspect-Sentiment Topic Embedding.
49 stars 5 forks source link

JASen for German data #5

Open JoannaSimm opened 2 years ago

JoannaSimm commented 2 years ago

Hello, first of all thank you for the very interesting paper! I would like to perform ABSA on non-English (German) data set. I was able to successfully reproduce you code on the English laptop dataset. Then I tried to run the code using German data set (which is basically the translated laptop data set) and therefore I downloaded German GloVe embeddings from this page https://www.deepset.ai/german-word-embeddings. I replaced the word2vec_100 txt file with glove embeddings and adjusted the embedding_length argument in evaluation.py since the vectors have length 300. But I get the following errors:

joanna@LAPTOP-RJC2NHSG:/mnt/c/users/joann/onedrive/code$ bash run_jasen.sh
make: 'joint' is up to date.
Starting training using file ./datasets/laptop/train.txt
Training with specificity; Specificity values output to file ./datasets/laptop/emb_mix_spec.txt
Reading topics from file ./datasets/laptop/senti_w_kw.txt, ./datasets/laptop/aspect_w_kw.txt
Document embedding output to: ./datasets/laptop/emb_mix_d.txt
Vocab size: 103098
Words in train file: 4312331
Loading embedding from file word2vec_100.txt
**[ERROR] Embedding dimension incompatible with pretrained file!**
make: 'margin' is up to date.
Starting training using file ./datasets/laptop/train.txt
Training with specificity; Specificity values output to file ./datasets/laptop/emb_senti_w_kw_spec.txt
Reading topics from file ./datasets/laptop/senti_w_kw.txt
Context embedding output to: ./datasets/laptop/emb_senti_w_kw_v.txt
Vocab size: 103098
Words in train file: 4312331
Corpus size: 7500
[ERROR] Topic name hochwerig not found in vocabulary!
Starting training using file ./datasets/laptop/train.txt
Training with specificity; Specificity values output to file ./datasets/laptop/emb_aspect_w_kw_spec.txt
Reading topics from file ./datasets/laptop/aspect_w_kw.txt
Context embedding output to: ./datasets/laptop/emb_aspect_w_kw_v.txt
Vocab size: 103098
Words in train file: 4312331
Corpus size: 7500
Read 8 topics
Support Service Garantie        Abdeckung       Ersatz
os      windows ios     mac     system
Display Bildschirm      Monitor LED     Auflösung
Batterie        Lebensdauer     Akkulaufzeit    Leistung        aufladen
unternehmen     produkt hp      toshiba dell    apple   lenovo
Maus    Touch   Track   Taste   pad
Software        Programme       Anwendungen     iTunes  Foto
Tastatur        Taste   Leertaste       Typ     Tastenfeld
Pre-training for 2 epochs, in total 2 + 5 = 7 epochs
Alpha: 0.014290  Progress: 42.88%  Words/thread/sec: 34.92k
Category (Support):     Support Service Garantie Abdeckung Ersatz Batterien.
Category (os):  os windows ios mac system lion
Category (Display):     Display Bildschirm Monitor LED Auflösung tft
Category (Batterie):    Batterie Lebensdauer Akkulaufzeit Leistung aufladen Batterielebensdauer
Category (unternehmen):         unternehmen produkt hp toshiba dell apple lenovo zenbook,
Category (Maus):        Maus Touch Track Taste pad Desktop-Maus
Category (Software):    Software Programme Anwendungen iTunes Foto Illustrator,
Category (Tastatur):    Tastatur Taste Leertaste Typ Tastenfeld umgebenden
Alpha: 0.010722  Progress: 57.20%  Words/thread/sec: 34.87k
Category (Support):     Support Service Garantie Abdeckung Ersatz Batterien. hardware_failure)
Category (os):  os windows ios mac system lion mac_os
Category (Display):     Display Bildschirm Monitor LED Auflösung tft hell
Category (Batterie):    Batterie Lebensdauer Akkulaufzeit Leistung aufladen Batterielebensdauer "11
Category (unternehmen):         unternehmen produkt hp toshiba dell apple lenovo zenbook, 6973
Category (Maus):        Maus Touch Track Taste pad Desktop-Maus m505
Category (Software):    Software Programme Anwendungen iTunes Foto Illustrator, CS4,
Category (Tastatur):    Tastatur Taste Leertaste Typ Tastenfeld umgebenden steuertaste
Alpha: 0.007163  Progress: 71.50%  Words/thread/sec: 34.99k
Category (Support):     Support Service Garantie Abdeckung Ersatz hardware_failure) Batterien. "physisch"
Category (os):  os windows ios mac system lion mac_os betriebssystem
Category (Display):     Display Bildschirm Monitor LED Auflösung tft hell Brightview
Category (Batterie):    Batterie Lebensdauer Akkulaufzeit Leistung aufladen "11 Batterielebensdauer 4-15
Category (unternehmen):         unternehmen produkt hp toshiba dell apple lenovo 6973 zenbook, m17x
Category (Maus):        Maus Touch Track Taste pad m505 Desktop-Maus mk320
Category (Software):    Software Programme Anwendungen iTunes Foto CS4, Illustrator, CPU-lastige
Category (Tastatur):    Tastatur Taste Leertaste Typ Tastenfeld steuertaste umgebenden drücken),
Alpha: 0.003561  Progress: 85.79%  Words/thread/sec: 34.88k
Category (Support):     Support Service Garantie Abdeckung Ersatz hardware_failure) "physisch" Batterien. quit_working.
Category (os):  os windows ios mac system lion mac_os betriebssystem osx
Category (Display):     Display Bildschirm Monitor LED Auflösung tft Brightview hell 16-10-Verhältnis
Category (Batterie):    Batterie Lebensdauer Akkulaufzeit Leistung aufladen "11 4-15 Batterielebensdauer geringer_Helligkeit
Category (unternehmen):         unternehmen produkt hp toshiba dell apple lenovo 6973 m17x zenbook, q1,
Category (Maus):        Maus Touch Track Taste pad m505 Desktop-Maus mk320 (logitec)
Category (Software):    Software Programme Anwendungen iTunes Foto CPU-lastige CS4, Illustrator, Draw
Category (Tastatur):    Tastatur Taste Leertaste Typ Tastenfeld steuertaste drücken), umgebenden 'unwirksam'
Alpha: 0.000002  Progress: 100.12%  Words/thread/sec: 34.89k
Category (Support):     Support Service Garantie Abdeckung Ersatz hardware_failure) "physisch" Batterien. quit_working. bootfähig,
Category (os):  os windows ios mac system lion mac_os betriebssystem osx nächstes.
Category (Display):     Display Bildschirm Monitor LED Auflösung tft 16-10-Verhältnis Brightview hell 1280x760
Category (Batterie):    Batterie Lebensdauer Akkulaufzeit Leistung aufladen "11 4-15 geringer_Helligkeit Batterielebensdauer Akku
Category (unternehmen):         unternehmen produkt hp toshiba dell apple lenovo 6973 m17x q1, zenbook, acers,
Category (Maus):        Maus Touch Track Taste pad m505 Desktop-Maus mk320 (logitec) keyboardLogitech
Category (Software):    Software Programme Anwendungen iTunes Foto CPU-lastige CS4, Illustrator, Draw Indesign,
Category (Tastatur):    Tastatur Taste Leertaste Typ Tastenfeld steuertaste drücken), umgebenden 'unwirksam' linke_klick
Topic mining results written to file ./datasets/laptop/res_aspect_w_kw.txt
Traceback (most recent call last):
  File "evaluate.py", line 317, in <module>
    joint_word_emb, vocabulary, vocabulary_inv = get_emb(os.path.join(args.dataset, w_emb_file))
  File "evaluate.py", line 44, in get_emb
    f = open(vec_file, 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'datasets/laptop/emb_mix_w.txt'
teapot123 commented 2 years ago

Hi Joanna,

Thanks for your interest in the paper! From your error message, it seems there are several errors in embedding training that need to be handled:

1) Loading embedding from file word2vec_100.txt [ERROR] Embedding dimension incompatible with pretrained file!: It seems the loaded word2vec_100.txt file has a different dimension from the argument of margin.c/joint.c (not evaluation.py). My suggestion is that you can check the run_jasen.sh file and see whether line 35, line 53, line 67, "-size 100" is set to "-size 300"

2) [ERROR] Topic name hochwerig not found in vocabulary!: It seems the word "hochwerig" is not collected by the program, this is because the frequency of that word in your corpus is lower than "-min-count 2" (also set in line 35, line 53, line 67). I suggest that you can use some other similar topic names that appear more frequently in the corpus, or if you really want to use this word, you can see how many times it appears in the corpus and change the "-min-count" setting.

JoannaSimm commented 2 years ago

Hi Jiaxin,

thanks for the quick reply and your help!

I was able to solve the problem after editing the run_jasen.sh as you said. I also made the following changes in the code in order to use the embeddings of different length (300 instead of 100):

  1. I manually added a line in the german pre-trained word embedding txt file consisting of a dimension of the embedding matrix (as it was in the original word2vec_100 file). After adding this line the Embedding dimension error didn't show up anymore.
  2. I made following changes to the evaluation.py file:

line 350: s_rep = np.sum([joint_word_emb[w] if w in vocabulary else np.zeros((300)) for w in s.split(' ')], axis=0)/len(text)

line 366: s_rep = np.sum([marginal_w_emb[aspect][w] if w in vocabulary1 else np.zeros((300)) for w in s.split(' ')], axis=0)/len(text)

lines 442-444:

joint_embedding = torch.zeros((len(vocabulary)+1, 300)) 
aspect_embedding = torch.zeros((len(vocabulary)+1, 300))
senti_embedding = torch.zeros((len(vocabulary)+1, 300))

lines 452-454

joint_model = CNN(batch_size, output_size, 1, 20, [2,3,4], 1, 0, 0.0, len(vocabulary)+1, 300, joint_embedding)
aspect_model = CNN(batch_size, len(aspect_topic), 1, 20, [2,3,4], 1, 0, 0.0, len(vocabulary)+1, 300, aspect_embedding)
senti_model = CNN(batch_size, len(senti_topic), 1, 20, [2,3,4], 1, 0, 0.0, len(vocabulary)+1, 300, senti_embedding)

I also had some problems with the German test file since the variables in this file were separeted by space instead of tab. After correcting this the code worked also for the German data sets.

JoannaSimm commented 2 years ago

Hi @teapot123,

I have one more question regarding your ABSA model. I'm trying to perform ABSA on a German dataset with 27 different aspects. My data set contains 300.000 reviews. However, after the 4th round of expand topic the following error appears:

run_jasen.sh: line 36: 7632 Segmentation fault (core dumped) ./src/joint -train ./datasets/${dataset}/${text_file} -topic1-name ./datasets/${dataset}/${topic_file1} -topic2-name ./datasets/${dataset}/${topic_file2} -load-emb ${pretrain_emb} -spec ./datasets/${dataset}/emb_${topic}_spec.txt -res ./datasets/${dataset}/res_${topic}.txt -k 10 -expand 1 -word-emb ./datasets/${dataset}/emb_${topic}_w.txt -doc ./datasets/${dataset}/emb_${topic}_d.txt -topic-emb ./datasets/${dataset}/emb_${topic}_t.txt -size 300 -window 5 -negative 5 -sample 1e-3 -min-count 1 -threads 10 -binary 0 -iter 10 -pretrain 4 -global_lambda 2.5

The margin parts seem to work fine, it is the joint part that causes the error. The mixed files are not produced, which results in the following error:

Topic mining results written to file ./datasets/dat/res_aspect_w_kw.txt datasets/dat/emb_mix_w.txt Traceback (most recent call last): File "evaluate.py", line 318, in <module> joint_word_emb, vocabulary, vocabulary_inv = get_emb(os.path.join(args.dataset, w_emb_file)) File "evaluate.py", line 45, in get_emb f = open(vec_file, 'r') FileNotFoundError: [Errno 2] No such file or directory: 'datasets/dat/emb_mix_w.txt'

Do you maybe know how to fix it? Can joint.c file contain some errors or is it compiling error? Thank you!

JoannaSimm commented 2 years ago

after debugging joint.c I found out that the segfault occurs in line 1121: else g = (label - expTable[(int) ((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; since f is equal to -nan. Do you have an idea why this happens and how to fix it?

hoangnv735 commented 1 year ago

Sorry for digging this up. If you are still interested in fixing the "Core dumped" error, have you tried decreasing the number of threads (e.g., -threads 1) in the scripts? The program crashes with a "core dumped" error when I run it with 200 seed words per aspect. This tip helped me deal with it.