Closed MichaelRoeder closed 1 year ago
@IsraelAbebe, what do you think?
People use :: instead of ። because it's easy available in the keyboards We usually check for both cases to split sentences or replace them.
So I agree with the updates .
Regarding sentence separators they are havily used in traditional documents but currently they are being replaced by space.
Thank you Israel. I will accept the pull request.
Thank you for providing such a nice dataset. We are currently working on integrating them into GERBIL to enable other researchers to use them more easily. However, while working with the Amharic dataset, we encountered a severe issue.
Problem description
The Amharic language uses punctuation characters that are not common in other languages. The two important characters for this issue are the word separator
፡
and the full stop።
.This is an excerpt of the dataset (dev.txt):
The last character should be a full stop, i.e.,
።
. However, in this example and in other sentences in the dataset, the last line comprises two word separators (2x፡
). I think that this is a mistake and should be fixed within the dataset.Proposed fix
Replace
፡፡
with።
in all three files of the Amharic dataset.