sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
31 stars 3 forks source link

Preserve Spanish « open and closing » double quote marks in the training data. #297

Open davidbaines opened 8 months ago

davidbaines commented 8 months ago

It seems that these Spanish « open and closing » double quote marks are removed from the training data. A lot of pre and post processing is required to support them:

I did a find and replace to change them to English “open and closing quotes” prior to training. Then replace them in the source text with English open and close double quote marks so that it matches those seen in training.

Then edit the draft which is complicated by the fact that it has straight English double quotes - so there's no distinction between the opening and closing marks. More care had to be taken changing those into opening and closing Spanish quotation marks. 

davidbaines commented 8 months ago

There were reports from some teams that common double quotes were not appearing in the drafts where they should. We should check that all kinds of quotation marks travel through the pipeline as they should.

isaac091 commented 8 months ago

It looks like the MosesPunctNormalizer is substituting all types of quotes for the straight single and double quotes ' and " during tokenization, regardless of project/language.

Since the data NLLB was trained on was all normalized this way, we'll want to keep this preprocessing step, but we'll have to come up with a post-processing solution to add the proper punctuation back in.

mmartin9684-sil commented 7 months ago

Another version of this problem (from a different project) relates to their use of left/right single and double quotes in their translation. Since quotes are normalized to the single/double straight quotes, and only these quotes are present in the drafts, the translation teams need to convert the straight quotes back to their preferred left/right single/double quotes as they do their editing.

mmartin9684-sil commented 6 months ago

If a project is using decomposed characters (e.g., "B" and underscore) rather than composed characters (e.g., "Ḇ") in their text, the preprocessing flow may normalize the decomposed characters into their NFKC form, and train the model to produce this form of the characters in the drafts. This means that the draft text will not follow the project's practices for normalization, and the translation team will need to change these characters back to the representation that they use. We should look at using the normalization setting configured in the Paratext project to postprocess the draft.

mmartin9684-sil commented 6 months ago

Spaces in the project text are another potential normalization issue. Apparently there are as many as 14 different spaces (simple, em, en, etc), and some projects need to use more than one in their text. These various spaces are likely being normalized to a single space during preprocessing, and drafts likely only contain this single standardized space character, giving the translation team another editing task as they work with the drafts.