Malformed highlight tags

gsarti commented 3 years ago

Dear authors,

Just wanted to point out that a good number of highlights in SCAT sentences appear to be malformed, most likely due to sequential insertion of hon/hoff tags without accounting for the overhead of previous ones. Here are just some examples, but there are many more:

highlighted.train.en

row 383: We get up and then <hon>it < <hon>hoff><hoff> ends up snowing a foot on us.
row 9907: <hon>They<hoff> didn't have enough mo <hon>ney to su < <hon>hoff<hoff> > pport themselves... so they go and have nine kids.
row 9916: Everybody reads the paper since you <hon <hon>> m<hoff> ade<hoff> it a daily.
row 10497: But that doesn't stop a young <hon>Platecarpus < <hon>hof<hoff> f> ... when it wants a snack.
row 10549: Just imagine how far away from us you'd have to move <hon>the Sun < <hon>hof<hoff> f> to make it appear as small and faint as a star.
row 10967: In cuba, with <hon>people < <hon>hoff><hoff> like me, they always found a reason to hit us.
row 11196: -You mean, when <hon <hon>> i<hoff> t < <hon>ho<hoff> ff> hardens, it-- -lt turns into plastic.
row 11211: And then if <hon>the boys<hoff> do want to farm, or if Laurie marries someone that would like <hon>to farm. <hoff <hon>> ..<hoff> and the boys don't want to... at least they have that college education to fall back on.
row 11225: If you ever get to be astronauts, you're going to thank us for making you wear <hon>these jumpsuits <hon><hoff<hoff> > because they provide ease of movement and additional storage space in orbit.
row 11213: <hon>The weather < <hon>ho<hoff> ff> does as it pleases.
row 11232: She was wearing a <hon>muumu <hon>u <<hoff> hoff <hon>><hoff> , but it had to be slit so she could fit into it.

highlighted.test.fr

row 750: J'ai fait un calcul rapide et il <hon>éta <hon>it <hoff<hoff> > peu probable qu'il dise un truc important ou qu'il fasse une interview télévisée, donc je ne pensais pas priver ma chaine de grand chose.
row 901: Ravi l'a attrapé, e <hon>t <hoff> noté l'adresse référencée sur le document que vous nous avez donné, et nous pensions qu'il pourrait <hon <hon>> être <hof<hoff> f> intéressant de le souligner.

I don't think the amount of corrupted data is enough to cause significant disruption in your results, but for sure they may be an issue. Would you consider implementing a well-formedness check for tags, and correct malformed examples? Thank you in advance!

gsarti commented 2 years ago

Hi @kayoyin @CoderPat,

Could you please let me know if you intend to fix the issue with the highlights? Thank you in advance!

CoderPat commented 2 years ago

I didn't work on this part, but do you have statistics for the prevalence of the malformed tags? If it's very small and it doesn't break the code it prob won't change the results much. Does it break your use case for them? I would just recommend dropping the samples in that case

neulab / contextual-mt

Malformed highlight tags #16