sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Update Thrift IO structs, add file sink #129

Closed sadikovi closed 6 years ago

sadikovi commented 6 years ago

This PR makes the following changes:

sadikovi commented 6 years ago

@sunchao could you review this PR, please? I am open to suggestions for new names of structs! Couple of questions that I would like to know your opinion on:

Thanks!

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 547


Files with Coverage Reduction New Missed Lines %
util/io.rs 1 98.92%
encodings/encoding.rs 1 94.64%
schema/types.rs 6 96.68%
schema/parser.rs 10 96.04%
file/reader.rs 15 96.57%
<!-- Total: 33 -->
Totals Coverage Status
Change from base Build 533: 0.04%
Covered Lines: 10861
Relevant Lines: 11399

💛 - Coveralls
sunchao commented 6 years ago

Thanks @sadikovi . I think you are right: Cursor<T> seems only useful when T is Vec<u8> or &[u8], so not quite useful for us. I'm inclined for Seek though since we'd eventually need that, and BufReader<File> already implements that. I'm not sure whether we need a trait for getting position - maybe just add a method to FileSink?

sadikovi commented 6 years ago

@sunchao I updated the PR. We do not use Thrift wrappers, and I replaced Position trait with built-in Seek implementation.

But we only support SeekFrom::Current(0) for FileSink, this means returning the current position in the sink. This is because we would have to flush all data to the file, in order to update position. Right now we just return an Err, if the offset requires cursor movements. (Technically we could support SeekFrom::Start(x), when 0 + x == current position, similar to SeekFrom::End(x), when end - x == current position).

Let me know if this is required, I will update the code.

sunchao commented 6 years ago

@sadikovi : could you explain the motivation for tracking cursor position without flushing the internal buffer to file? is this about sharing file handle among multiple sinks?

Also, could we also implement Seek for FileSource? this can be done via a separate PR though.

sadikovi commented 6 years ago

@sunchao My main idea was minimising number of flushes to a file. We need to extract position quite a few times in the writer, so I thought we should make sure that it does not result in any unnecessary IO. Note that we only need to know the current position, we do not seek in the writer.

Let me know what you think.

For FileSource, do we need to implement all SeekFrom::Start, SeekFrom::Current, SeekFrom::End?

sunchao commented 6 years ago

@sadikovi : I see. Sorry for the confusion. In that case, maybe we don't need Seek after all, since we now just use Seek to get the position - perhaps we just need to add a method pos() for the FileSink?

I can work on implementing Seek for FileSource: seems we just need to implement seek(&mut self, pos: SeekFrom).

sadikovi commented 6 years ago

Yes, you are right. I was thinking if we should instead add Position trait, so both FileSource and FileSeek could implement it. They both have a pointer to the current position. Let me know what you think.

The reason for me to have a trait was testing - I could use Cursor<Vec<u8>> for file sink in tests.

sunchao commented 6 years ago

Sounds good @sadikovi . I agree with you.

sadikovi commented 6 years ago

Okay, thanks. Will make the changes and let you know.

sadikovi commented 6 years ago

I updated the code:

sadikovi commented 6 years ago

@sunchao Let me know if we need to change/update anything!

sunchao commented 6 years ago

Looks good @sadikovi . Merged. Thanks!

sadikovi commented 6 years ago

@sunchao Thanks for merging!