neulab / contextual-mt

A repository with the code related to experiments around context-aware machine translation
48 stars 9 forks source link

Malformed highlight tags #16

Open gsarti opened 3 years ago

gsarti commented 3 years ago

Dear authors,

Just wanted to point out that a good number of highlights in SCAT sentences appear to be malformed, most likely due to sequential insertion of hon/hoff tags without accounting for the overhead of previous ones. Here are just some examples, but there are many more:

highlighted.train.en

highlighted.test.fr

I don't think the amount of corrupted data is enough to cause significant disruption in your results, but for sure they may be an issue. Would you consider implementing a well-formedness check for tags, and correct malformed examples? Thank you in advance!

gsarti commented 2 years ago

Hi @kayoyin @CoderPat,

Could you please let me know if you intend to fix the issue with the highlights? Thank you in advance!

CoderPat commented 2 years ago

I didn't work on this part, but do you have statistics for the prevalence of the malformed tags? If it's very small and it doesn't break the code it prob won't change the results much. Does it break your use case for them? I would just recommend dropping the samples in that case