kavgan / nlp-in-practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
http://kavita-ganesan.com/kavitas-tutorials/#.WvIizNMvyog
1.14k stars 785 forks source link

Dataset file is not a gzip file #1

Closed oersoy1 closed 4 years ago

oersoy1 commented 6 years ago
$ tar -zxvf reviews_data.txt.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Inspect file:

$ file reviews_data.txt.gz 
reviews_data.txt.gz: HTML document, UTF-8 Unicode text, with very long lines

$head reviews_data.txt.gz 

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">

Python gives gzip error as well

/usr/lib/python3.5/gzip.py in _read_gzip_header(self)
    407 
    408         if magic != b'\037\213':
--> 409             raise OSError('Not a gzipped file (%r)' % magic)
    410 
    411         (method, flag,

OSError: Not a gzipped file (b'\n\n')
lakeofsoft commented 5 years ago

what you got is not the original .gz file, but a github HTML page which describes it

kavgan commented 5 years ago

Also you can use gunzip reviews_data.txt.gz

kavgan commented 5 years ago

Duplicate of #2