masakhane-io / masakhane-ner

Other
104 stars 51 forks source link

Faulty full stop character in the Amharic dataset #25

Closed MichaelRoeder closed 1 year ago

MichaelRoeder commented 1 year ago

Thank you for providing such a nice dataset. We are currently working on integrating them into GERBIL to enable other researchers to use them more easily. However, while working with the Amharic dataset, we encountered a severe issue.

Problem description

The Amharic language uses punctuation characters that are not common in other languages. The two important characters for this issue are the word separator and the full stop .

This is an excerpt of the dataset (dev.txt):

አምቦ B-LOC
ከዚህ O
በኋላ O
የቱሪዝም O
የባህል O
እና O
የፖለቲካ O
ማዕከል O
ትሆናለች O
፡፡ O

The last character should be a full stop, i.e., . However, in this example and in other sentences in the dataset, the last line comprises two word separators (2x). I think that this is a mistake and should be fixed within the dataset.

Proposed fix

Replace ፡፡ with in all three files of the Amharic dataset.

dadelani commented 1 year ago

@IsraelAbebe, what do you think?

IsraelAbebe commented 1 year ago

People use :: instead of ። because it's easy available in the keyboards We usually check for both cases to split sentences or replace them.

So I agree with the updates .

Regarding sentence separators they are havily used in traditional documents but currently they are being replaced by space.

dadelani commented 1 year ago

Thank you Israel. I will accept the pull request.