ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
346 stars 175 forks source link

PyArrow returns empty data frame #75

Open fabiotakaki opened 5 years ago

fabiotakaki commented 5 years ago

I don't know why, when reading the file generated by parquetjs with pyspark works fine, but when reading from pyarrow returns a empty dataframe.

In this example using spark return the table normally:

parquet_file = "./file.parquet"  # Should be some file on your system
spark = SparkSession.builder.appName("TestingParquet").getOrCreate()
parquetFile = spark.read.parquet(parquet_file)

parquetFile.createOrReplaceTempView("parquetFile")
list = spark.sql("SELECT * FROM parquetFile")
list.show()

spark.stop()

When I try by pyarrow:

import pandas as pd

if __name__== "__main__":
    df = pd.read_parquet('file.parquet', engine='pyarrow')
    print df
    print df.dtypes

Just return Empty Dataframe with the header's columns. Anyone with that problem?

szdominik commented 5 years ago

I think you've already found the solution, but it would be useful for everyone else to note that the problem is probably around DATA_PAGE/DATA_PAGE_V2 specifications (see at https://github.com/ZJONSSON/parquetjs/issues/24#issuecomment-416009322) and the solution is this: https://github.com/ZJONSSON/parquetjs#notes