lixin4ever / BERT-E2E-ABSA

[EMNLP 2019 Workshop] Exploiting BERT for End-to-End Aspect-based Sentiment Analysis
https://arxiv.org/abs/1910.00883
Apache License 2.0
392 stars 89 forks source link

Question about dataset preprocessing #45

Open torivor opened 1 year ago

torivor commented 1 year ago

From what I understand based on the official paper, the approach used in this repository is trying to predict the following sequence of tags based on the input sentence: image

The train.txt files on the data folder are used for training the model to classify such sequence. I also noticed that each line in the file consist of both (!) sentence sequence as well as (2) tag sequence which is separated by "####". Regarding this, I have several questions:

  1. How did you annotate the original XML dataset into the current BIEOS/BIO/OT tagging scheme? Is there an open-source tool to easily annotate the tagging scheme to an unlabeled dataset??
  2. How to preprocess the tag sequence from the train.txt to its appropriate format for model training?
lixin4ever commented 1 year ago

Sorry for the late reply.

For question #1: I write s script by myself to convert the original XML files into the dataset of the current format.

For question #2: No further preprocessing needed. It is already appropriate for model training.

Nikhith10 commented 1 year ago

Can you please provide the Tagging Notebook to annotate the custom dataset,So that it can be flexible to train our custom datasets, Thank you.