Open entalent opened 6 years ago
Hi,I see the same question. Have you salved this question?
What's more, there are many typos in the text annotations. How do you use these imperfect data?
I simply replaced the replace characters with space and used the sentences for training, without correcting the typos. Maybe you can use some spellcheck tool to correct them automatically.
Yeah, I replaced the replace characters with space too. And at present, I am trying to use the spellcheck tool, named symspellpy for correction. However, this also seems not very perfect.
It seems that some txt files in the released cvpr2016_cub.tar.gz contains some unusual character... I downloaded the file from google drive twice, and the extracted data always have the same problem. For example, when I open text_c10/007.Parakeet_Auklet/Parakeet_Auklet_0065_795969.txt , line 5 is shown as
this��bird��has��a��white��belly,��black��body��and��wings,��a��slender��white��eye��patch,��and��a��short,��stubby�▒▒ ▒▒orange��bill.
As is shown in hex editors, these unusual characters are 0xEF 0xBF 0xBD (replacement character in UTF-8 encoding). Although these characters can be replaced with space and then the line looks ok, could you please inspect into this problem and release new txt files?