ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
349 stars 174 forks source link

invalid parquet version error for parquet files generated via python script #144

Open saritvakrat opened 10 months ago

saritvakrat commented 10 months ago

Hi, I am trying to read parquet files that are in S3 and were generated via python script. I get the following error: Error: thrown: "invalid parquet version" When I am trying to read similar file but the file was generated via spark - it manages to digest the file and read it.

I am also able to parse the python file and open it in a parquet viewer

Any idea why? the file is parquet lvl 2 File metadata: file written by pyarrow 11.0.0 created_by: parquet-cpp-arrow version 11.0.0 num_columns: 6 num_rows: 42 num_row_groups: 1 format_version: 2.6 serialized_size: 3975

Full error:

(node:41711) V8: /Users/saritvakrat/Documents/automation/be_automation/node_modules/brotli/build/encode.js:34 Linking failure in asm.js: Unexpected stdlib member (Usenode --trace-warnings ...` to show where the warning was created) console.error Error parsing Parquet file: invalid parquet version

  39 |         return records;
  40 |     } catch (error) {
> 41 |         console.error('Error parsing Parquet file:', error);
     |                 ^
  42 |         throw error; // Rethrow the error to be handled by the caller
  43 |     }
  44 | }`

  Packages:
      "parquetjs": "^0.11.2",
"@types/parquetjs": "^0.10.6",

My function: `export async function parseParquetFile(filePath: string): Promise<any[]> { try { // create new ParquetReader const reader = await ParquetReader.openFile(filePath) as any; // create a new cursor const cursor = reader.getCursor(); const records = []; // read all records from the file and print them let record = await cursor.next(); while (record !== null) { records.push(record); record = await cursor.next(); } await reader.close(); return records; } catch (error) { console.error('Error parsing Parquet file:', error); throw error; // Rethrow the error to be handled by the caller } } `

`async parseSingleParquetFromS3(bucketName: string, key: string | null | undefined): Promise<any[]> { if (!bucketName || !key) { throw new Error('S3 client or bucket name is not provided'); }

    const getObjectCommand = new GetObjectCommand({
        Bucket: bucketName,
        Key: key
    });

    let objectResponse;
    try {
        objectResponse = await this.s3Client.send(getObjectCommand);
    } catch (error) {
        console.error(`Error fetching object from S3: ${error}`);
        throw error;
    }

    const objectData = objectResponse.Body;
    if (!(objectData instanceof Readable)) {
        throw new Error('Object data is not a readable stream');
    }

    const fileName = key.split('/').pop() || 'temp.parquet';
    const tempFilePath = join(tmpdir(), fileName);

    try {
        await pipeline(objectData, createWriteStream(tempFilePath));
        return await parseParquetFile(tempFilePath);
    } catch (error) {
        console.error(`Error in streaming data to file: ${error}`);
        throw error;
    }
}`
WestenMichael commented 10 months ago

I have same issue

WestenMichael commented 10 months ago

As workaround I used the https://www.npmjs.com/package/@dsnp/parquetjs

saritvakrat commented 10 months ago

@WestenMichael I tried this package as well, but they have another issue "invalid encoding: RLE_DICTIONARY"