simdutf / simdutf

Unicode routines (UTF8, UTF16, UTF32) and Base64: billions of characters per second using SSE2, AVX2, NEON, AVX-512, RISC-V Vector Extension. Part of Node.js, WebKit/Safari, Ladybird, Cloudflare Workers and Bun.
https://simdutf.github.io/simdutf/
Apache License 2.0
1.16k stars 75 forks source link

create higher level base64 functions #377

Open lemire opened 7 months ago

lemire commented 7 months ago

We should also provide base64_to_binary(const char input) -> std::vector, that does calculation of safe size and allocate memory internally. Or maybe something like base_to_binary(const char input, cont: &Container) and static_assert that the Container has method resize. (credit: @WojciechMula)

WojciechMula commented 7 months ago

I was thinking a little about API. My generic proposal is providing a convenient wrapper that would work incrementally. I mean: user provides partial data (like input buffer when reading from file) and output buffer of fixed size. Using for decoding would be something like:

auto decoder = Base64Decoder::new();

std::string input;
input.resize(32 * 1024);

std::string output;
output.resize(16 * 1024);
while (/**/) {
      // read a few kilobytes data from into `input`

      const size_t bytes_stored = decoder.decode(input.data(), input.size(), output.data(), output.size());
      // bytes_stored will never be greater than output.size()

      write (output.data(), bytes_stored)

      if input file reached EOF {
          while (decoder.pending_output()) {
              const size_t bytes_stored = decoder.flush(output.data(), output.size());
              write (output.data(), bytes)
         }
      }
}

Of course this flexibility is at cost of performance, but my gut feeling is that if somebody want to process data in chunks, than problem is likely I/O bound.

WojciechMula commented 7 months ago

Another thing for base64 encoding - it would be practical if we allowed wrapping output, for instance:

const size_t max_line_length = 72;
const char* separator = "\n";
encode(input, output, max_line_length, separator);

Again, nobody would expect that this variant will be as fast as the plain encoding.

lemire commented 7 months ago

@WojciechMula I'm pinging you later today as I have a major upgrade to the base64 support, with a slightly improved API.

lemire commented 7 months ago

Please see https://github.com/simdutf/simdutf/pull/382 where the base64 API was slightly extended (i.e., we have _safe functions).