jorgecarleitao / parquet2

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow
Other
355 stars 59 forks source link

Write bloom filters #213

Open ozgrakkurt opened 1 year ago

ozgrakkurt commented 1 year ago

This pr mentions this requires big changes: https://github.com/jorgecarleitao/parquet2/pull/99. But this seems like a feature that is important to implement for performance. How doable is it in the current state of the library? I would like to work on it if possible

ozgrakkurt commented 1 year ago

Hey! @jorgecarleitao can you give guidence on this? I started doing it. What I come up with is something like this:

/// Creates a bloom filter from the bitset and writes it into the `writer`.
pub fn write<R: Write + Seek>(
    column_metadata: &mut ColumnChunkMetaData,
    mut writer: &mut W,
    bitset: &[u8],
) -> Result<(), Error> {

    // create bloom filter header
    // create TCompactInputProtocol containing the bloom filter
    // write the offset to column_metadata
    // write the bloom filter to the writer

}

does it look correct?

edit: actually I found that is should be something like this:

/// Creates a bloom filter from the bitset and writes it into the `writer`.
pub fn write(
    protocol: &mut TCompactOutputProtocol,
    bitset: &[u8],
) -> Result<(), Error> {

    // create bloom filter header
    // create TCompactInputProtocol containing the bloom filter
    // write the offset to column_metadata
    // write the bloom filter to the protocol

}