Hi, I am trying to read parquet files that are in S3 and were generated via python script.
I get the following error:
Error: thrown: "invalid parquet version"
When I am trying to read similar file but the file was generated via spark - it manages to digest the file and read it.
I am also able to parse the python file and open it in a parquet viewer
Any idea why? the file is parquet lvl 2
File metadata:
file written by pyarrow 11.0.0
created_by: parquet-cpp-arrow version 11.0.0
num_columns: 6
num_rows: 42
num_row_groups: 1
format_version: 2.6
serialized_size: 3975
Full error:
(node:41711) V8: /Users/saritvakrat/Documents/automation/be_automation/node_modules/brotli/build/encode.js:34 Linking failure in asm.js: Unexpected stdlib member (Usenode --trace-warnings ...` to show where the warning was created)
console.error
Error parsing Parquet file: invalid parquet version
39 | return records;
40 | } catch (error) {
> 41 | console.error('Error parsing Parquet file:', error);
| ^
42 | throw error; // Rethrow the error to be handled by the caller
43 | }
44 | }`
Packages:
"parquetjs": "^0.11.2",
"@types/parquetjs": "^0.10.6",
My function:
`export async function parseParquetFile(filePath: string): Promise<any[]> { try { // create new ParquetReader const reader = await ParquetReader.openFile(filePath) as any; // create a new cursor const cursor = reader.getCursor(); const records = []; // read all records from the file and print them let record = await cursor.next(); while (record !== null) { records.push(record); record = await cursor.next(); } await reader.close(); return records; } catch (error) { console.error('Error parsing Parquet file:', error); throw error; // Rethrow the error to be handled by the caller } }
`
`async parseSingleParquetFromS3(bucketName: string, key: string | null | undefined): Promise<any[]> {
if (!bucketName || !key) {
throw new Error('S3 client or bucket name is not provided');
}
const getObjectCommand = new GetObjectCommand({
Bucket: bucketName,
Key: key
});
let objectResponse;
try {
objectResponse = await this.s3Client.send(getObjectCommand);
} catch (error) {
console.error(`Error fetching object from S3: ${error}`);
throw error;
}
const objectData = objectResponse.Body;
if (!(objectData instanceof Readable)) {
throw new Error('Object data is not a readable stream');
}
const fileName = key.split('/').pop() || 'temp.parquet';
const tempFilePath = join(tmpdir(), fileName);
try {
await pipeline(objectData, createWriteStream(tempFilePath));
return await parseParquetFile(tempFilePath);
} catch (error) {
console.error(`Error in streaming data to file: ${error}`);
throw error;
}
}`
Hi, I am trying to read parquet files that are in S3 and were generated via python script. I get the following error: Error: thrown: "invalid parquet version" When I am trying to read similar file but the file was generated via spark - it manages to digest the file and read it.
I am also able to parse the python file and open it in a parquet viewer
Any idea why? the file is parquet lvl 2 File metadata: file written by pyarrow 11.0.0 created_by: parquet-cpp-arrow version 11.0.0 num_columns: 6 num_rows: 42 num_row_groups: 1 format_version: 2.6 serialized_size: 3975
Full error:
(node:41711) V8: /Users/saritvakrat/Documents/automation/be_automation/node_modules/brotli/build/encode.js:34 Linking failure in asm.js: Unexpected stdlib member (Use
node --trace-warnings ...` to show where the warning was created) console.error Error parsing Parquet file: invalid parquet versionMy function:
`export async function parseParquetFile(filePath: string): Promise<any[]> { try { // create new ParquetReader const reader = await ParquetReader.openFile(filePath) as any; // create a new cursor const cursor = reader.getCursor(); const records = []; // read all records from the file and print them let record = await cursor.next(); while (record !== null) { records.push(record); record = await cursor.next(); } await reader.close(); return records; } catch (error) { console.error('Error parsing Parquet file:', error); throw error; // Rethrow the error to be handled by the caller } }
``async parseSingleParquetFromS3(bucketName: string, key: string | null | undefined): Promise<any[]> { if (!bucketName || !key) { throw new Error('S3 client or bucket name is not provided'); }