seemingly corrupted text data

entalent commented 6 years ago

It seems that some txt files in the released cvpr2016_cub.tar.gz contains some unusual character... I downloaded the file from google drive twice, and the extracted data always have the same problem. For example, when I open text_c10/007.Parakeet_Auklet/Parakeet_Auklet_0065_795969.txt , line 5 is shown as this��bird��has��a��white��belly,��black��body��and��wings,��a��slender��white��eye��patch,��and��a��short,��stubby�▒▒ ▒▒orange��bill. As is shown in hex editors, these unusual characters are 0xEF 0xBF 0xBD (replacement character in UTF-8 encoding). Although these characters can be replaced with space and then the line looks ok, could you please inspect into this problem and release new txt files?

ayumiymk commented 5 years ago

Hi，I see the same question. Have you salved this question?

ayumiymk commented 5 years ago

What's more, there are many typos in the text annotations. How do you use these imperfect data?

entalent commented 5 years ago

I simply replaced the replace characters with space and used the sentences for training, without correcting the typos. Maybe you can use some spellcheck tool to correct them automatically.

ayumiymk commented 5 years ago

Yeah, I replaced the replace characters with space too. And at present, I am trying to use the spellcheck tool, named symspellpy for correction. However, this also seems not very perfect.

reedscot / cvpr2016

seemingly corrupted text data #9