jcrobak / parquet-python

python implementation of the parquet columnar file format.
Apache License 2.0
337 stars 257 forks source link

Two different errors when reading two different files #54

Open Khris777 opened 7 years ago

Khris777 commented 7 years ago

I'm using parquet on Windows 10 and I have two different parquet files for testing, one is snappy-compressed, one is not compressed.

Simple test code for reading:

with open(filename,'r') as f:
    for row in parquet.reader(f):
        print row

The uncompressed file throws this error:

  File "E:/PythonDir/Diverses/DataTest.py", line 23, in <module>
    for row in parquet.reader(f):

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
    dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 275, in read_data_page
    raw_bytes = _read_page(fo, page_header, column_metadata)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 244, in _read_page
    page_header.uncompressed_page_size)

AssertionError: found 87 raw bytes (expected 367)

Reading the compressed file like that gives:

  File "E:/PythonDir/Diverses/DataTest.py", line 23, in <module>
    for row in parquet.reader(f):

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 393, in reader
    footer = _read_footer(fo)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 71, in _read_footer
    footer_size = _get_footer_size(fo)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 64, in _get_footer_size
    tup = struct.unpack("<i", fo.read(4))

error: unpack requires a string argument of length 4

I can open both files with fastparquet 0.0.5 just fine so there's nothing wrong with the files.

What am I doing wrong? Do I have to explicitely uncompress the data with snappy or is parquet doing that by itself? Can you in general provide some more documentation on the basic usage?

jcrobak commented 7 years ago

Hi @Khris777 - what version of python are on?

Khris777 commented 7 years ago

I'm using Python 2.7.

jcrobak commented 7 years ago

@Khris777 can you try opening the files in binary mode? i.e. with open(filename,'rb') ?

Khris777 commented 7 years ago

Using binary mode leads to the script not finishing at all.

It does not lock up, it just runs on and on. The two files are both less than 1 MB, so this is odd.

When killing the process after several minutes it throws the usual KeyboardInterrupt and gives out the line at which it was, and the output is variable, some examples:

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
    dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 344, in read_data_page
    dict_values_io_obj, bit_width, len(dict_values_bytes))

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 213, in read_rle_bit_packed_hybrid
    debug_logging = logger.isEnabledFor(logging.DEBUG)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\logging\__init__.py", line 1366, in isEnabledFor
    return level >= self.getEffectiveLevel()

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\logging\__init__.py", line 1355, in getEffectiveLevel
    if logger.level:

===============================

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
    dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 343, in read_data_page
    values = encoding.read_rle_bit_packed_hybrid(

===============================

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
    dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 344, in read_data_page
    dict_values_io_obj, bit_width, len(dict_values_bytes))

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 222, in read_rle_bit_packed_hybrid
    while io_obj.tell() < length:
jcrobak commented 7 years ago

Any chance you could try to reproduce this on the 1.2 release that was just published?

Khris777 commented 7 years ago

I will once I figure out why the latest python-snappy version fails to install.

Khris777 commented 7 years ago

Okay, now things are like this.

I installed parquet and snappy into my Python 3.6 environment and there parquet works flawlessly, I can read everything just like I can using fastparquet. I did a fresh install, fetching a precompiled snappy-wheel from http://www.lfd.uci.edu/~gohlke/pythonlibs/ and getting the latest parquet with pip.

On Python 2.7 however it still doesn't work. I updated the parquet package normally using pip after also installing the precompiled snappy-wheel for 2.7.

I have the same data in three different formats, uncompressed, snappy-compressed, and gzip-compressed. All three always throw the same error so it doesn't seem to be a compression problem.

My testing code:

r1 = []
filename = "E:\\Temp\\uncompressedParquetFile.parquet"
with open(filename,'rb') as f:
    for row in parquet.reader(f):
        r1.append(row)

throws this error:

Traceback (most recent call last):

  File "<ipython-input-9-bb9230901f59>", line 1, in <module>
    runfile('E:/PythonDir/Diverses/parquetTest.py', wdir='E:/PythonDir/Diverses')

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "E:/PythonDir/Diverses/parquetTest.py", line 22, in <module>
    for row in parquet.reader(f):

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
    dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 344, in read_data_page
    dict_values_io_obj, bit_width, len(dict_values_bytes))

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 227, in read_rle_bit_packed_hybrid
    res += read_bitpacked(io_obj, header, width, debug_logging)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 146, in read_bitpacked
    b = raw_bytes[current_byte]

IndexError: list index out of range

Without binary mode with open(filename,'r') as f: it's this error:

Traceback (most recent call last):

  File "<ipython-input-10-bb9230901f59>", line 1, in <module>
    runfile('E:/PythonDir/Diverses/parquetTest.py', wdir='E:/PythonDir/Diverses')

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "E:/PythonDir/Diverses/parquetTest.py", line 22, in <module>
    for row in parquet.reader(f):

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 393, in reader
    footer = _read_footer(fo)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 78, in _read_footer
    fmd.read(pin)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\thrift.py", line 112, in read
    iprot.read_struct(self)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
    val = self.read_val(ftype, fspec)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 299, in read_val
    result.append(self.read_val(v_type, v_spec))

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
    self.read_struct(obj)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
    val = self.read_val(ftype, fspec)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 299, in read_val
    result.append(self.read_val(v_type, v_spec))

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
    self.read_struct(obj)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
    val = self.read_val(ftype, fspec)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
    self.read_struct(obj)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
    val = self.read_val(ftype, fspec)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
    self.read_struct(obj)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 250, in read_struct
    fname, ftype, fid = self.read_field_begin()

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 181, in read_field_begin
    return None, self._get_ttype(type), fid

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 134, in _get_ttype
    return TTYPES[byte & 0x0f]

KeyError: 14
jcrobak commented 7 years ago

Oh interesting. I'd love to try to recreate this issue. How are you generating the parquet file that it fails on?

Khris777 commented 7 years ago

The files are generated on a Cloudera Hadoop Cluster version 5.4.4 in Java by a colleague. I asked him for some code and he gave me the parts that write the parquet file, it's part of a larger file though:

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.IndexedRecord;
import org.apache.avro.reflect.ReflectData;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.Path;

import parquet.avro.AvroSchemaConverter;
import parquet.avro.AvroWriteSupport;
import parquet.hadoop.ParquetWriter;
import parquet.hadoop.metadata.CompressionCodecName;
import parquet.schema.MessageType;

public static final WriterVersion DEFAULT_WRITER_VERSION = WriterVersion.PARQUET_1_0;

Schema avroSchema = new Schema.Parser().parse(avroSchemaFile);

MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);

AvroWriteSupport writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);

File parquetFile = new File("parquetFile.parquet");

Path parquetFilePath = new Path(parquetFile.toURI());

try (ParquetWriter<IndexedRecord> parquetFileWriter =
        new ParquetWriter<IndexedRecord>(parquetFilePath, writeSupport, CompressionCodecName.SNAPPY, ParquetWriter.DEFAULT_BLOCK_SIZE, ParquetWriter.DEFAULT_PAGE_SIZE))
{
    for (UploadedXmlDTO uploadedXML : uploadedXMLs) 
    {
        GenericRecord record = new GenericData.Record(avroSchema);

        record.put("date", uploadedXML.getDate());
        record.put("xml", ByteBuffer.wrap(uploadedXML.getXml()));

        parquetFileWriter.write(record);
    }
}

Maybe this helps a little, I can't provide you with the files because of company policies.