kavgan / nlp-in-practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
http://kavita-ganesan.com/kavitas-tutorials/#.WvIizNMvyog
1.14k stars 785 forks source link

binary data inside text lines of reviews_data.txt.gz of word2vec sample #2

Closed lakeofsoft closed 4 years ago

lakeofsoft commented 5 years ago

there is a binary RAR file snugged inside text lines of "reviews_data.txt.gz"

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

036572F0  6E 20 65 6C 20 65 71 75 69 70 61 6A 65 20 65 6E  n el equipaje en
03657300  20 65 6C 20 68 6F 74 65 6C 09 09 0D 0A 46 65 62   el hotel....Feb
03657310  20 32 31 20 32 30 30 39 20 09 74 72 E8 73 20 62   21 2009 .très b
03657320  6F 6E 20 72 61 70 70 6F 72 74 20 71 75 61 6C 69  on rapport quali
03657330  74 E9 20 70 72 69 78 09 09 0D 0A 4A 61 6E 20 34  té prix....Jan 4
03657340  20 32 30 30 39 20 09 43 61 72 69 6E 6F 20 6D 61   2009 .Carino ma
03657350  20 76 65 63 63 68 69 6F 2E 09 09 0D 0A 44 65 63   vecchio.....Dec
03657360  20 32 36 20 32 30 30 38 20 09 3F 3F 3F 3F 3F 3F   26 2008 .??????
03657370  3F 3F 3F 3F 09 09 0D 0A 4F 63 74 20 32 35 20 32  ????....Oct 25 2
03657380  30 30 38 20 09 74 72 E8 73 20 62 6F 6E 20 68 F4  008 .très bon hô
03657390  74 65 6C 09 09 0D 0A 53 65 70 20 32 33 20 32 30  tel....Sep 23 20
036573A0  30 38 20 09 65 78 63 65 6C 6C 65 6E 74 65 20 65  08 .excellente e
036573B0  78 70 E9 72 69 65 6E 63 65 09 09 0D 0A 52 61 72  xpérience....Rar
036573C0  21 1A 07 00 CF 90 73 00 00 0D 00 00 00 00 00 00  !...Ï.s.........
036573D0  00 07 2A 74 80 90 4E 00 EF 72 03 00 B2 E0 0A 00  ..*t€.N.ïr..²à..
036573E0  02 6A 2C 9E 26 17 52 83 3B 1D 33 29 00 20 00 00  .j,ž&.Rƒ;.3). ..
036573F0  00 75 73 61 5F 6E 65 76 61 64 61 5F 6C 61 73 2D  .usa_nevada_las-
03657400  76 65 67 61 73 5F 72 69 76 69 65 72 61 5F 68 6F  vegas_riviera_ho
03657410  74 65 6C 5F 63 61 73 69 6E 6F 00 B0 72 6F 91 14  tel_casino.°ro‘.
03657420  1D 51 0C CC D1 51 90 19 D9 7E CF 35 AC E8 72 AF  .Q.ÌÑQ..Ù~Ï5¬èr¯
03657430  4F 31 A5 96 49 A6 93 6E AA 9D 79 F0 E6 3F 4D 55  O1¥–I¦“nª.yðæ?MU
03657440  2B E3 A6 DD B6 A9 BF 2B E8 0A 20 A4 38 89 00 D8  +ã¦Ý¶©¿+è. ¤8‰.Ø
03657450  00 A3 46 BA 32 F3 3A 0F 3B 47 35 6D D7 A1 A4 4A  .£Fº2ó:.;G5mס¤J
03657460  8D EE 22 26 64 12 9F 2F CE 80 BB E7 38 DB 24 09  .î"&d.Ÿ/΀»ç8Û$.
03657470  13 E8 89 8F 6C C0 3D 21 1F C2 6B F7 ED CE F7 20  .è‰.lÀ=!.Âk÷íÎ÷ 

inside that RAR there is only one file named "usa_nevada_las-vegas_riviera_hotel_casino", which contains some duplicated lines from .gz file

this causes UnicodeDecodeError exception under Windows

you can use open(...., errors='replace') to replace binary data with ? marks

kavgan commented 5 years ago

Thanks @lakeofsoft let me try to look into it in the next few days

kavgan commented 5 years ago

I re-uploaded the data files, if it doesn't work you can now download the unzipped text file as well: https://github.com/kavgan/nlp-text-mining-working-examples/tree/master/word2vec

lakeofsoft commented 5 years ago

Sorry, but I can't see any difference. Both text files (the one inside .gz and unpacked "reviews_data.txt") still contains binary data which appears to be a RAR archive.

Simply open the file in any text viewer/editor and search for "Rar!" string, or go directly to line #53936 (byte offset 0x036573B0).

As a side note, for anyone who (like me) is struggling to get the "reviews_data.txt" content (and not the mere lfs link), these commands might help:

> git lfs install
> git lfs pull
kavgan commented 5 years ago

🤔 I wonder why it doesn't complain on my Mac, are you using a windows based machine?

kavgan commented 5 years ago

ahh just saw that its windows.

lakeofsoft commented 5 years ago

yes, open(...) uses locale.getpreferredencoding(False) when encoding is not specified, and it is platform dependent. You can try it with:

open(...., encoding="mac_latin2") # will not crash (default for Mac)

or

open(...., encoding="cp1252") # will crash (default for Windows)

and it will read nothing on Android, where default encoding is "utf_8"