mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
363 stars 7 forks source link

C++ API documentation #30

Open marekkokot opened 9 months ago

marekkokot commented 9 months ago

Hi, This is a great project! Congratulations! Is there C++ API documentation somewhere? I was able to use the code from C++ based on your test code, but the documentation would be excellent.

Best Marek

mxmlnkn commented 9 months ago

No, there is no API documentation, yet. There are some doxygen comments but not for everything and I guess this doesn't help if you don't know where to look and the many inline implementations are probably also distracting. I also have never run doxygen on this project, so the comments might not even compile to something readable. I probably should at least add an example to the C++ library section, which currently is rather sparse.

The important class is ParallelGzipReader but because the constructor only works with UniqueFileReader, I would also have to document the FileReader interface and all its derived classes.

/**
 * @note Calls to this class are not thread-safe! Even though they use threads to evaluate them in parallel.
 */
template<typename T_ChunkData = ChunkData,
         bool ENABLE_STATISTICS = false>
class ParallelGzipReader final :
    public FileReader

{
public:
    using ChunkData = T_ChunkData;
    using WriteFunctor = std::function<void ( const std::shared_ptr<ChunkData>&, size_t, size_t )>;
    using Window = WindowMap::Window;

public:
    explicit
    ParallelGzipReader( UniqueFileReader fileReader,
                        size_t           parallelization = 0,
                        uint64_t         chunkSizeInBytes = 4_Mi );

    ~ParallelGzipReader();

    /**
     * @note Only will work if ENABLE_STATISTICS is true.
     */
    void
    setShowProfileOnDestruction( bool showProfileOnDestruction );

    /* FileReader overrides */

    [[nodiscard]] UniqueFileReader
    clone() const override;

    [[nodiscard]] int
    fileno() const override;

    [[nodiscard]] bool
    seekable() const override;

    void
    close() override;

    [[nodiscard]] bool
    closed() const override;

    [[nodiscard]] bool
    eof() const override;

    [[nodiscard]] bool
    fail() const override;

    [[nodiscard]] size_t
    tell() const override;

    [[nodiscard]] std::optional<size_t>
    size() const override;

    void
    clearerr() override;

    [[nodiscard]] size_t
    read( char*  outputBuffer,
          size_t nBytesToRead ) override;

    /* Simpler file reader interface for Python-interfacing */

    size_t
    read( const int    outputFileDescriptor = -1,
          char* const  outputBuffer         = nullptr,
          const size_t nBytesToRead         = std::numeric_limits<size_t>::max() );

    size_t
    read( const WriteFunctor& writeFunctor,
          const size_t        nBytesToRead = std::numeric_limits<size_t>::max() );

    size_t
    seek( long long int offset,
          int           origin = SEEK_SET ) override;

    /* Block compression specific methods */

    [[nodiscard]] bool
    blockOffsetsComplete() const;

    /**
     * @return vectors of block data: offset in file, offset in decoded data
     *         (cumulative size of all prior decoded blocks).
     */
    [[nodiscard]] std::map<size_t, size_t>
    blockOffsets();

    /**
     * This is the first instance for me where returning a const value makes sense because it contains
     * a shared pointer to the WindowMap, which is not to be modified. Making GzipIndex const forces
     * the caller to deep clone the index and WindowMap for, e.g., the setBlockOffsets API, which
     * destructively moves from the WindowMap.
     */
    [[nodiscard]] const GzipIndex
    gzipIndex();

    /**
     * Same as @ref blockOffsets but it won't force calculation of all blocks and simply returns
     * what is availabe at call time.
     * @return vectors of block data: offset in file, offset in decoded data
     *         (cumulative size of all prior decoded blocks).
     */
    [[nodiscard]] std::map<size_t, size_t>
    availableBlockOffsets() const;

    [[nodiscard]] auto
    statistics() const;

    void
    setCRC32Enabled( bool enabled );

    void
    setMaxDecompressedChunkSize( uint64_t maxDecompressedChunkSize );

    [[nodiscard]] uint64_t
    maxDecompressedChunkSize() const noexcept;
}
marekkokot commented 9 months ago

Hi,

Thanks! This is what I figured out. Here is my example code, maybe it will be useful for someone, and maybe you may check if its correct:

#include <iostream>
#include <ParallelGzipReader.hpp>
#include <vector>
using namespace std;

//compile command:
//g++ -std=c++20 -O3 -I rapidgzip  -I core -fconstexpr-ops-limit=99000100 main.cpp -lz
int main(int argc, char**argv) {

    if(argc < 4) {
        std::cerr << "Usage: " << argv[0] << " <input_file> <n_threads> <reader_chunk_size_in_MB>\n";
        return 1;
    }
    UniqueFileReader file_reader = std::make_unique<StandardFileReader>(argv[1]);
    const size_t n_threads = std::atoi(argv[2]);
    const size_t reader_chunk_size = std::atoi(argv[3]) * (1ull<<20);

    rapidgzip::ParallelGzipReader<> reader(std::move(file_reader), n_threads, reader_chunk_size);

    //reader.setCRC32Enabled( true );

    const size_t chunk_size = 1ull<<24;
    std::vector<char> chunk(chunk_size);

    while(true) {
        auto R = reader.read(chunk.data(), chunk_size);
        if(!R)
            break;
        std::cout.write(chunk.data(), R);
    }

    std::cerr << "eof?: " << reader.eof() << "\n";
    return 0;
}

I was a little surprised by the necessity of using -fconstexpr-ops-limit Also, I found this method setCRC32Enabled, I guess with this enabled CRC is checked, and without it not? I have seen #5 so I am not sure if CRC is now implemented or not? Is there any overhead of CRC?

Do you know if this code compiles under Windows Visual Studio?

mxmlnkn commented 9 months ago

maybe you may check if its correct:

Looks correct.

I was a little surprised by the necessity of using -fconstexpr-ops-limit

Yeah, I guess this is something to be added to the documentation. There is some heavy lookup table compile-time computation going on. That's why it is necessary and why the binary is rather large.

Also, I found this method setCRC32Enabled, I guess with this enabled CRC is checked, and without it not? I have seen #5 so I am not sure if CRC is now implemented or not? Is there any overhead of CRC?

5 is done: "Added with 08b453f. It adds ~5-6% overhead."

The CRC32 computation has been turned on by default since 4567fe16, i.e., since 0.11.0.

Do you know if this code compiles under Windows Visual Studio?

It should. The Python packages are compiled for Windows using Github Actions. Similar compile arguments are necessary: https://github.com/mxmlnkn/rapidgzip/blob/8444017524230b7ba6836d06ae9ef6893a07a6a9/python/rapidgzip/setup.py#L267

karelbilek commented 3 months ago

@marekkokot thx for this code!