Question about dataset preprocessing

torivor commented 1 year ago

From what I understand based on the official paper, the approach used in this repository is trying to predict the following sequence of tags based on the input sentence:

The train.txt files on the data folder are used for training the model to classify such sequence. I also noticed that each line in the file consist of both (!) sentence sequence as well as (2) tag sequence which is separated by "####". Regarding this, I have several questions:

How did you annotate the original XML dataset into the current BIEOS/BIO/OT tagging scheme? Is there an open-source tool to easily annotate the tagging scheme to an unlabeled dataset??
How to preprocess the tag sequence from the train.txt to its appropriate format for model training?

lixin4ever commented 1 year ago

Sorry for the late reply.

For question #1: I write s script by myself to convert the original XML files into the dataset of the current format.

For question #2: No further preprocessing needed. It is already appropriate for model training.

Nikhith10 commented 1 year ago

Can you please provide the Tagging Notebook to annotate the custom dataset,So that it can be flexible to train our custom datasets, Thank you.

lixin4ever / BERT-E2E-ABSA

Question about dataset preprocessing #45