Open gsarti opened 3 years ago
Hi @kayoyin @CoderPat,
Could you please let me know if you intend to fix the issue with the highlights? Thank you in advance!
I didn't work on this part, but do you have statistics for the prevalence of the malformed tags? If it's very small and it doesn't break the code it prob won't change the results much. Does it break your use case for them? I would just recommend dropping the samples in that case
Dear authors,
Just wanted to point out that a good number of highlights in SCAT sentences appear to be malformed, most likely due to sequential insertion of
hon
/hoff
tags without accounting for the overhead of previous ones. Here are just some examples, but there are many more:highlighted.train.en
We get up and then <hon><p>it</p> < <hon>hoff><hoff> ends up snowing a foot on us.
<hon>They<hoff> didn't have enough mo <hon>ney to su < <hon>hoff<hoff> > pport themselves... so <p>they</p> go and have nine kids.
Everybody reads the paper since you <hon <hon>> m<hoff> ade<hoff> <p>it</p> a daily.
But that doesn't stop a young <hon>Platecarpus < <hon>hof<hoff> f> ... when <p>it</p> wants a snack.
Just imagine how far away from us you'd have to move <hon>the Sun < <hon>hof<hoff> f> to make <p>it</p> appear as small and faint as a star.
In cuba, with <hon>people < <hon>hoff><hoff> like me, <p>they</p> always found a reason to hit us.
-You mean, when <hon <hon>> i<hoff> t < <hon>ho<hoff> ff> hardens, <p>it</p>-- -lt turns into plastic.
And then if <hon>the boys<hoff> do want to farm, or if Laurie marries someone that would like <hon>to farm. <hoff <hon>> ..<hoff> and the boys don't want to... at least <p>they</p> have that college education to fall back on.
If you ever get to be astronauts, you're going to thank us for making you wear <hon>these jumpsuits <hon><hoff<hoff> > because <p>they</p> provide ease of movement and additional storage space in orbit.
<hon>The weather < <hon>ho<hoff> ff> does as <p>it</p> pleases.
She was wearing a <hon>muumu <hon>u <<hoff> hoff <hon>><hoff> , but <p>it</p> had to be sl<p>it</p> so she could f<p>it</p> into <p>it</p>.
highlighted.test.fr
J'ai fait un calcul rapide et <p>il</p> <hon>éta <hon>it <hoff<hoff> > peu probable qu'<p>il</p> dise un truc important ou qu'<p>il</p> fasse une interview télévisée, donc je ne pensais pas priver ma chaine de grand chose.
Ravi l'a attrapé, e <hon>t <hoff> noté l'adresse référencée sur le document que vous nous avez donné, et nous pensions qu'<p>il</p> pourrait <hon <hon>> être <hof<hoff> f> intéressant de le souligner.
I don't think the amount of corrupted data is enough to cause significant disruption in your results, but for sure they may be an issue. Would you consider implementing a well-formedness check for tags, and correct malformed examples? Thank you in advance!