powturbo / Turbo-Base64

Turbo Base64 - Fastest Base64 SIMD:SSE/AVX2/AVX512/Neon/Altivec - Faster than memcpy!
GNU General Public License v3.0
277 stars 41 forks source link

invalid UTF-8 bytes after “=" "==" #23

Open Lemonononon opened 11 months ago

Lemonononon commented 11 months ago

Hi @powturbo Thanks for your great work! Recently, I discovered that when using this library to encode image files, there are some strange characters appearing at the end. Printing them out shows 'NULL' or just some patterns. ( Like ”7mlbMjdKxLobZAOx6jFekoqMbHg==�#��+Z��8Z�s)��k_H���pd�?���Ծ ” "Px/wA7sn4uWWf/AAj/AA3/ALQooor0Yg==NULLNULLNULL"

My code:

std::ifstream ifs(file_path, std::ios::binary);
if (!ifs.is_open()) {
    std::cerr << "Unable to open file: " << file_path << std::endl;
}

ifs.seekg(0, std::ios::end);
auto size = ifs.tellg();
ifs.seekg(0, std::ios::beg);

// Read the file content into a char buffer
auto buf = new unsigned char[size];
ifs.read((char *) buf, size);

//use turbobase64
auto outsize = tb64enclen(size);
auto out = new uint8_t[outsize];

size_t num_enc = tb64enc(buf, size, out); //error handle

out[num_enc] = 0;

std::string str_encode(out, out + num_enc);

std::cout << str_encode << std::endl;

I'm confused. Shouldn't the size of a string converted to Base64 be fixed? Why are there unknown characters appearing

powturbo commented 11 months ago

The output size is fixed to ((input_size + 2)/3 * 4). You must use : auto out = new uint8_t[outsize +1]; when you put 0 at the end of the buffer with out[num_enc] = 0.

Lemonononon commented 11 months ago

@powturbo Thank you! Previously, I discovered this issue and made attempts using [output_size+1], but I still couldn't achieve the desired outcome. What I meant is that the length of the entire string ( including the non-UTF-8 characters after the == ) equals to ((input_size + 2)/3 * 4).

Afterwards, I directly added an identical cpp file to the library source code, and the compiled, the executed result was correct. And then I found that when using the static lib local installed ( cmake .. && make install ), only the results from tb64senc are correct, as shown in the following image. And I added 'set(BUILD_SHARED_LIBS ON)' to the CMakeLists.txt file to get shared lib, then all the results were correct. ( This result was reproduced on two computers running Ubuntu os ) . My problem is resolved now, but I'm still confused. I'll do my best to provide you with the information I have

a45b371f8f59447da86f4c2e97168b6

powturbo commented 11 months ago

There are not separate functions for static and dynamic linking. Wondering why you're getting different sizes depending on the linking mode. Anyway the correct size is ((input_size + 2)/3 * 4), the base64 characters are all ascii and with the same utf-8 1 byte coding points. You can decode the base64 encoded and check against the original string.