Closed sadikovi closed 7 years ago
Example code I am trying to run cargo run --bin example
(I place it in src/bin/example.rs
in repository. I will convert it into usage example, if it is okay.
extern crate parquet;
use std::fs::File;
use std::path::Path;
use parquet::basic::*;
use parquet::data_type::*;
use parquet::column::reader::{ColumnReaderImpl, get_typed_column_reader};
use parquet::file::reader::{FileReader, SerializedFileReader};
use parquet::schema::printer::print_parquet_metadata;
fn print_values<'a, T: DataType>(column: usize, typed_column_reader: &mut ColumnReaderImpl<'a, T>) where T: 'static {
// for now just hard code this
let batch_size = 8;
let mut actual_values = vec![T::T::default(); batch_size];
let mut actual_def_levels = vec![i16::default(); batch_size];
let mut actual_rep_levels = vec![i16::default(); batch_size];
let mut curr_values_read = 0;
let mut curr_levels_read = 0;
loop {
let (values_read, levels_read) = typed_column_reader.read_batch(
batch_size,
Some(&mut actual_def_levels[curr_levels_read..]),
Some(&mut actual_rep_levels[curr_levels_read..]),
&mut actual_values[curr_values_read..]
).unwrap();
curr_values_read += values_read;
curr_levels_read += levels_read;
if values_read == 0 {
break;
}
}
println!("Read column {}: {} values ({} levels): {:?}",
column, curr_values_read, curr_levels_read, actual_values);
}
fn main() {
println!("Reading Parquet file");
let file_path = "data/alltypes_plain.snappy.parquet";
let path = Path::new(file_path);
let file = File::open(&path).unwrap();
let parquet_reader = SerializedFileReader::new(file).unwrap();
let metadata = parquet_reader.metadata();
let num_row_groups = metadata.num_row_groups();
print_parquet_metadata(&mut std::io::stdout(), metadata);
for i in 0..num_row_groups {
println!("Row group: {}", i);
let row_group_reader = parquet_reader.get_row_group(i).unwrap();
let num_columns = row_group_reader.num_columns();
let row_group_metadata = metadata.row_group(i);
for j in 0..num_columns {
// let mut page_reader = row_group_reader.get_column_page_reader(j).unwrap();
// while let Some(page) = page_reader.get_next_page().unwrap() {
// println!("Row group: {}, column: {}, read new page (num values: {}, encoding: {}, data: {:?})",
// i, j, page.num_values(), page.encoding(), page.buffer().data());
// }
let column_chunk_metadata = row_group_metadata.column(j);
let column_reader = row_group_reader.get_column_reader(j).unwrap();
let typed_column_reader = match column_chunk_metadata.column_type() {
Type::INT32 => {
Some(get_typed_column_reader::<Int32Type>(column_reader))
},
other => {
// Just testing int32 columns for now
// I will implement this for other columns later
println!("! Skip column {}, unknown type: {}", j, other);
None
},
};
if let Some(mut typed_column_reader) = typed_column_reader {
print_values(j, &mut typed_column_reader);
}
}
}
}
@sunchao I will convert it into 2 spaces indent when ready:)
@sunchao Could you review this problem? It is also possible that my code to read is wrong. I could submit pull request if it happens to be a bug.
@sadikovi Yes this does seem like a bug.. Please file a PR. Thanks!
@sunchao I am happy to do that. Thanks.
Apart from this issue: do you think it is good idea to add usage examples? I am very interested to know what you think about adding higher level API for reading (whether or not it is overall a good idea) and what ideas you have for such API - so we would have to write fewer code to read records (can also be useful for testing), e.g. Parquet -> JSON.
@sadikovi Yes I think it's good to add a few examples. Currently the ColumnReaderImpl
is not easy to use. We could consider add a "scanner" type that functions as an iterator over the values in a column. Something like this.
In future I'm thinking about adding conversion to Apache Arrow format, so the read API can support higher level types such as lists and maps.
@sunchao this sounds great! I am looking forward to conversion into Apache Arrow format!:)
I am currently working on Usage/Examples section in README and wanted to add couple of code snippets on how to use library (I hope you do not object to this). While working on it, found that there is a potential issue when reading columns with PLAIN_DICTIONARY encoding.
I am trying to read
data/alltypes_plain.snappy.parquet
file that is in repository with code that is attached below.Discovered that this returns me empty buffers when I read values. For example, the first column in Parquet file is "id" that has values
6
and7
:When I run my code it prints following:
All returned buffers are empty.
Turns out that there are several issues with reading values: when reading dictionary page we should not increment number of seen values, because they are just for dictionary, we have not read actual values yet, and also encoding assignment for data page .
diff contains updated
test_file_reader
test:With these changes I get following output - looks correct: