minio / minio-cpp

MinIO C++ Client SDK for Amazon S3 Compatible Cloud Storage
https://minio-cpp.min.io/
Apache License 2.0
135 stars 56 forks source link

Replace the filename type of DownloadObjectArgs and UploadObjectArgs #149

Open z-pc opened 4 months ago

z-pc commented 4 months ago

Hi. I have a problem with the filename attribute type of DownloadObjectArgs and UploadObjectArgs. The current type is std::string, this leads to a file with utf-8 characters in the path that cannot read. Example path: G:\\Tuấn Anh\tết.txt. So, I suggest replacing std::string with std::filesystem::path Thanks.

kobalicek commented 4 months ago

I think this is not fixing the root cause of the problem though. The C++ std::filesystem is kinda weird when it comes to UTF-8 encoding. In the past char* and std::string were considered to be using local encoding (which meant using code-pages on Windows), and this haven't changed, just Linux and others went with UTF-8 for all locales, but not Windows (Windows has unicode-aware API that uses wchar_t instead).

Changing std::string to std::filesystem::path may look as a solution, but I think it's not the right one. Maybe using std::u8string instead of std::string would be much better, but then we will hit more problems like poor support for std::u8string in the rest of the C++ standard library.

Not sure what would be the best thing to do in this case, I would like to just tell C++'s filesystem that we have indeed an UTF-8 encoded string to avoid going through mbtowc.

z-pc commented 4 months ago

@kobalicek I agree with you, u8string() is not yet widely used. It can cause more string-handling problems in this repository, I think. Above all, we will have big problems with std::fstream.open(...) if u8string() is used as standard.

Currently, I only use u8string() with the object attribute:

std::filesystem::path file = L"G:\\Tuấn Anh\\tết.txt";
  minio::s3::UploadObjectArgs args;
  args.bucket = "storage";
  args.object = file.filename().u8string();

I'm also not sure which is better. However, it seems we don't have much choice.

kobalicek commented 4 months ago

If you use L"G:\\Tuấn Anh\\tết.txt" it means you are using wchar_t* string and not char* string. wchar_t has a defined encoding by platform, but char doesn't always mean UTF-8 and the problem here is UTF-8 vs 8-bit locale used by Windows.

When it comes to Windows the library code would most likely use WideChar API under the hood, so using wchar_t string seems okay, but I think we should not expose wchar_t in public API as it's not a that common outside Windows (unless you use .NET/Java that exposes UTF-16 string as part of the runtime).

I'm still not sure about the best solution here, whether exposing std::filesystem in public API makes sense or not.

harshavardhana commented 4 months ago

what does AWS CPP library do in this scenario? @kobalicek

z-pc commented 4 months ago

I think we should not expose wchar_t in public API

I think so, too. It's a hard change for current users.

kobalicek commented 4 months ago

what does AWS CPP library do in this scenario? @kobalicek

I have no idea.

Usually when you want UTF-8 file naming on Windows you would convert your UTF-8 string to Windows WideChar (UTF-16) and use WinAPI unicode-aware functions to access the file. And here the problem is file names on Windows platform can even have invalid surrogate pairs, etc... as the API doesn't validate the names, it just wants a sequence of 16-bit chars (similarly Linux API just wants a sequence of bytes). This means that if you use a regular validating UTF-8 -> UTF-16 conversion it could refuse your "invalid" file names. And quite frankly, I have no idea how C++ std::filesystem even deals with that - could be implementation specific, etc...

BTW check this issue regarding path names on Windows:

This means that under windows we would want to use WTF-8 encoding instead of UTF-8:

That's why I don't really know the best way of solving this in minio-cpp. Ideally we should do the conversion ourselves to be sure, but honestly that just sounds like a lot of work (basically to write the code and have it tested). Depending on std::filesystem is another option, but like everything in C++ the standard could just say this is implementation specific, which would mean that it's not guaranteed to work.

balamurugana commented 4 months ago

Looks like there is no simple solution available. Why can't we skip supporting such special use case in {Upload,Download}Object methods? There is always a working APIs available like {Get, Put}Object methods. {Upload,Download}Object are just wrapper to these methods with additional file open/close. Let the special users use {Get,Put}Object methods.

harshavardhana commented 4 months ago

Looks like there is no simple solution available. Why can't we skip supporting such special use case in {Upload,Download}Object methods? There is always a working APIs available like {Get, Put}Object methods. {Upload,Download}Object are just wrapper to these methods with additional file open/close. Let the special users use {Get,Put}Object methods.

yeah since this is a language specific problem, we can make this assumption and also add a doc line on what to do when a user wishes to deal with UTF-8.

z-pc commented 4 months ago

Using Get|Put is fine, but have a lot of work to do for self-handle. At first, I just thought Upload|DownloadObject needs a parameter type to pass to std::fstream(...), and std::filesystem is well supported.

balamurugana commented 4 months ago

Using Get|Put is fine, but have a lot of work to do for self-handle. At first, I just thought Upload|DownloadObject needs a parameter type to pass to std::fstream(...), and std::filesystem is well supported.

  1. std::fstream(...) is already supported by {Get,Put}Object APIs. There is no point to support in {Upload,Download}Object APIs.
  2. I am unable to understand why usage of {Get,Put}Object is hard.
    std::ifstream file(filename);
    // construct PutObjectArgs
    PutObjectResponse resp = PutObject(args);
    file.close();
    // Use resp