Questions about reproducing the results

Spico197 commented 1 year ago

Hi there. Thanks for the excellent work. We are shocked by such a huge performance improvement. When reproducing the results, I encountered the following problems. I'd be very appreciative if you could help me solve them:

What is your bert-serving setting. Is it just bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4 ?
Could you please provide the repeat.json file in https://github.com/hawisdom/EDEE/blob/8cfd4f2e8128e13db2dab5e99876571d634651bb/datasets.py#L98 ?
I found you didn't use the development set, and directly evaluate the model on the test set to select the best one. Is there a potential comparison fairness problem?
You created a company name list company.txt and add those names into the user dict to avoid unexpected segmentation. But it brings boundary leakage.
It seems you are evaluating the results on the adjacency matrix directly (https://github.com/hawisdom/EDEE/blob/8cfd4f2e8128e13db2dab5e99876571d634651bb/trainer.py#L164), which is inconsistent with other baselines like Doc2EDAG. Do you report such metrics in you paper?

Thanks for your kindness and looking forward to your reply.

hawisdom commented 1 year ago

Thank you for your interest in our work. The bert-serving is your understanding, and we updated the datasets.py to include the code generating repeat.json. Also, in experiments, we have used the development set and conducted experiments. The corresponding average effect (Avg) is 94.24%. Since the experimental hyperparameters were set according to manual experience and not adjusted, following the source code of the paper "Relational Graph Attention Network for Aspect-based Sentiment Analysis", we used the test set to evaluate the performance of the model. Regarding metrics, we employ token classification due to tokens as the prediction objects, which we have reported in the paper.

Spico197 commented 1 year ago

Thank you very much for the response and the updated code.

I'm still confusing about the metrics. I've understood the token classification metrics. However, the other baseline systems for comparison use event-instance-based metrics, maybe it is not fair to compare different metrics and claim a new SOTA result ? Have you ever tried to evaluate the model via the F1-score metric introduced in Doc2EDAG?

edzq commented 9 months ago

It is unfair to compare with other works using different metrics and claim that you are the new SOTA. Please @hawisdom 😂

hawisdom / EDEE

Questions about reproducing the results #1