Closed sadikovi closed 6 years ago
@sunchao could you review this PR, please? I am open to suggestions for new names of structs! Couple of questions that I would like to know your opinion on:
std::io::Cursor
instead of FileSink
, e.g. Cursor<BufReader<File>>
?std::io::Seek
trait instead of Position
trait?Thanks!
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
util/io.rs | 1 | 98.92% | ||
encodings/encoding.rs | 1 | 94.64% | ||
schema/types.rs | 6 | 96.68% | ||
schema/parser.rs | 10 | 96.04% | ||
file/reader.rs | 15 | 96.57% | ||
<!-- | Total: | 33 | --> |
Totals | |
---|---|
Change from base Build 533: | 0.04% |
Covered Lines: | 10861 |
Relevant Lines: | 11399 |
Thanks @sadikovi . I think you are right: Cursor<T>
seems only useful when T
is Vec<u8>
or &[u8]
, so not quite useful for us. I'm inclined for Seek
though since we'd eventually need that, and BufReader<File>
already implements that. I'm not sure whether we need a trait for getting position - maybe just add a method to FileSink
?
@sunchao I updated the PR. We do not use Thrift wrappers, and I replaced Position
trait with built-in Seek
implementation.
But we only support SeekFrom::Current(0)
for FileSink
, this means returning the current position in the sink. This is because we would have to flush all data to the file, in order to update position. Right now we just return an Err
, if the offset requires cursor movements. (Technically we could support SeekFrom::Start(x)
, when 0 + x == current position, similar to SeekFrom::End(x)
, when end - x == current position).
Let me know if this is required, I will update the code.
@sadikovi : could you explain the motivation for tracking cursor position without flushing the internal buffer to file? is this about sharing file handle among multiple sinks?
Also, could we also implement Seek
for FileSource
? this can be done via a separate PR though.
@sunchao My main idea was minimising number of flushes to a file. We need to extract position quite a few times in the writer, so I thought we should make sure that it does not result in any unnecessary IO. Note that we only need to know the current position, we do not seek in the writer.
Let me know what you think.
For FileSource, do we need to implement all SeekFrom::Start
, SeekFrom::Current
, SeekFrom::End
?
@sadikovi : I see. Sorry for the confusion. In that case, maybe we don't need Seek
after all, since we now just use Seek
to get the position - perhaps we just need to add a method pos()
for the FileSink
?
I can work on implementing Seek
for FileSource
: seems we just need to implement seek(&mut self, pos: SeekFrom)
.
Yes, you are right. I was thinking if we should instead add Position trait, so both FileSource and FileSeek could implement it. They both have a pointer to the current position. Let me know what you think.
The reason for me to have a trait was testing - I could use Cursor<Vec<u8>>
for file sink in tests.
Sounds good @sadikovi . I agree with you.
Okay, thanks. Will make the changes and let you know.
I updated the code:
Position
trait that both FileSource
and FileSink
implement.start
and end
in FileSource to be u64
, because one could potentially start with large offset within the file. Note that end - start <= usize
, so that it is safe to compare with buf.len()
.@sunchao Let me know if we need to change/update anything!
Looks good @sadikovi . Merged. Thanks!
@sunchao Thanks for merging!
This PR makes the following changes:
FileSink
, which is a buffered wrapper on astd::fs::File
and also tracks position without requiring the flush data into a file.FileChunk
intoFileSource
TMemoryBuffer
, as we do not really need it.