Open z-pc opened 4 months ago
I think this is not fixing the root cause of the problem though. The C++ std::filesystem
is kinda weird when it comes to UTF-8 encoding. In the past char* and std::string were considered to be using local encoding (which meant using code-pages on Windows), and this haven't changed, just Linux and others went with UTF-8 for all locales, but not Windows (Windows has unicode-aware API that uses wchar_t instead).
Changing std::string
to std::filesystem::path
may look as a solution, but I think it's not the right one. Maybe using std::u8string
instead of std::string
would be much better, but then we will hit more problems like poor support for std::u8string
in the rest of the C++ standard library.
Not sure what would be the best thing to do in this case, I would like to just tell C++'s filesystem that we have indeed an UTF-8 encoded string to avoid going through mbtowc.
@kobalicek
I agree with you, u8string()
is not yet widely used. It can cause more string-handling problems in this repository, I think.
Above all, we will have big problems with std::fstream.open(...)
if u8string()
is used as standard.
Currently, I only use u8string()
with the object
attribute:
std::filesystem::path file = L"G:\\Tuấn Anh\\tết.txt";
minio::s3::UploadObjectArgs args;
args.bucket = "storage";
args.object = file.filename().u8string();
I'm also not sure which is better. However, it seems we don't have much choice.
If you use L"G:\\Tuấn Anh\\tết.txt"
it means you are using wchar_t*
string and not char*
string. wchar_t
has a defined encoding by platform, but char
doesn't always mean UTF-8 and the problem here is UTF-8 vs 8-bit locale used by Windows.
When it comes to Windows the library code would most likely use WideChar API under the hood, so using wchar_t
string seems okay, but I think we should not expose wchar_t
in public API as it's not a that common outside Windows (unless you use .NET/Java that exposes UTF-16 string as part of the runtime).
I'm still not sure about the best solution here, whether exposing std::filesystem in public API makes sense or not.
what does AWS CPP library do in this scenario? @kobalicek
I think we should not expose wchar_t in public API
I think so, too. It's a hard change for current users.
what does AWS CPP library do in this scenario? @kobalicek
I have no idea.
Usually when you want UTF-8 file naming on Windows you would convert your UTF-8 string to Windows WideChar (UTF-16) and use WinAPI unicode-aware functions to access the file. And here the problem is file names on Windows platform can even have invalid surrogate pairs, etc... as the API doesn't validate the names, it just wants a sequence of 16-bit chars (similarly Linux API just wants a sequence of bytes). This means that if you use a regular validating UTF-8 -> UTF-16 conversion it could refuse your "invalid" file names. And quite frankly, I have no idea how C++ std::filesystem
even deals with that - could be implementation specific, etc...
BTW check this issue regarding path names on Windows:
This means that under windows we would want to use WTF-8 encoding instead of UTF-8:
That's why I don't really know the best way of solving this in minio-cpp. Ideally we should do the conversion ourselves to be sure, but honestly that just sounds like a lot of work (basically to write the code and have it tested). Depending on std::filesystem
is another option, but like everything in C++ the standard could just say this is implementation specific, which would mean that it's not guaranteed to work.
Looks like there is no simple solution available. Why can't we skip supporting such special use case in {Upload,Download}Object
methods? There is always a working APIs available like {Get, Put}Object
methods. {Upload,Download}Object
are just wrapper to these methods with additional file open/close. Let the special users use {Get,Put}Object
methods.
Looks like there is no simple solution available. Why can't we skip supporting such special use case in
{Upload,Download}Object
methods? There is always a working APIs available like{Get, Put}Object
methods.{Upload,Download}Object
are just wrapper to these methods with additional file open/close. Let the special users use{Get,Put}Object
methods.
yeah since this is a language specific problem, we can make this assumption and also add a doc line on what to do when a user wishes to deal with UTF-8.
Using Get|Put
is fine, but have a lot of work to do for self-handle.
At first, I just thought Upload|DownloadObject
needs a parameter type to pass to std::fstream(...)
, and std::filesystem
is well supported.
Using
Get|Put
is fine, but have a lot of work to do for self-handle. At first, I just thoughtUpload|DownloadObject
needs a parameter type to pass tostd::fstream(...)
, andstd::filesystem
is well supported.
std::fstream(...)
is already supported by {Get,Put}Object
APIs. There is no point to support in {Upload,Download}Object
APIs. {Get,Put}Object
is hard.
std::ifstream file(filename);
// construct PutObjectArgs
PutObjectResponse resp = PutObject(args);
file.close();
// Use resp
Hi. I have a problem with the
filename
attribute type ofDownloadObjectArgs
andUploadObjectArgs
. The current type isstd::string
, this leads to a file with utf-8 characters in the path that cannot read. Example path:G:\\Tuấn Anh\tết.txt
. So, I suggest replacingstd::string
withstd::filesystem::path
Thanks.