processwire / processwire-requests

ProcessWire feature requests.
40 stars 0 forks source link

Retain Non-ASCII chracters when uploading files #56

Open gideonso opened 7 years ago

gideonso commented 7 years ago

Short description of the enhancement

Make file name with NON ASCII characters possible. Short description goes here.

Optional: Steps that explain the enhancement

  1. Upload a file with non ascii characters
  2. The file name preserved as it is.

Current vs. suggested behavior

Current: All non ascii characters are stripped. Suggested: Preserve all non ascii characters

Why would the enhancement be useful to users?

For Asian users we use non ascii characters for file name. It is good to not need rename file name before upload.

Optional: Screenshots/Links that demonstrate the enhancement

1

2 2023-02-16 11-04-11 的螢幕擷圖 3

BitPoet commented 7 years ago

If this feature is added, it might be worth to add checks for Windows systems. On Windows, PHP versions < 7.1 do not support characters outside of the active locale. Starting with 7.1, everything should be fine.

ivangretsky commented 7 years ago

+1 for this. Very important for international community))

teppokoivula commented 7 years ago

Current behaviour has been very problematic for some (many) of our use cases. +1 from here too.

As a matter of fact I've got one project on my desk right now that needs uploaded files to remain as-is, since they are actually consumed by another tool that requires specific formatting. Currently I'm at a loss about how to implement this without "reinventing the wheel", i.e. creating my own file field.

gideonso commented 7 years ago

I am desperately need this. If someone can make this, can we raise some fund for it??

matjazpotocnik commented 7 years ago

I have a working solution here or let say a proof of concept. Changes to the core files are needed.

gideonso commented 7 years ago

Oh. This is good news. May be you can give us the git repository that we can download and try??

Gideon

matjazpotocnik commented 7 years ago

I need to do some more testing. Unfortunately, I can't create a module since a lot of methods are not hookable.

ivangretsky commented 7 years ago

@matjazpotocnik Seems to be a perfect case for a PR.

matjazpotocnik commented 7 years ago

I won't make PR as it it would not be accepted by Ryan and I understand why. You are looking for troubles if you want to support non-ascii in uploads. It's not the problem making the non-ascii filenames to get uploaded and displayed in the file list, but how would that file be stored and represented on the filesystem. I would rather leave the core files intact and make a module, but then again, some methods in the core would need to be hookable and you see how the number of feature-requests are growing here... As BitPoet said, PHP 7.1 supports UTF-8 filenames on Windows disregarding the OEM codepage, will see what that brings to the picture.

ivangretsky commented 7 years ago

Maybe just make a PR to make needed methods hookable?

matjazpotocnik commented 7 years ago

Hmm, will have to sleep over (again) and maybe go into another direction that wouldn't require so much hooking.

teppokoivula commented 7 years ago

It's not the problem making the non-ascii filenames to get uploaded and displayed in the file list, but how would that file be stored and represented on the filesystem.

Somewhat curious: what kind of problems would this cause? Personally I'd suggest making this possible at the core level, unless there's a very strong reason not to.

Make it configurable option for all that I care, but it's such a common need for non-English folks that IMHO it shouldn't be left out.

So far the only problems I can think of seem to be a) some potential for confusion regarding case (in)sensitive file handling by the OS, and b) the general idea that by filtering input extra carefully you can avoid potential issues on the output phase.

matjazpotocnik commented 7 years ago

Somewhat curious: what kind of problems would this cause?

How would you like to see the file "test_漢字汉字.txt" on the file system? As "test_漢字汉字.txt" or "test_漢字汉ĺ.txt"? The first version is created on windows using wfio, the second version is what windows do by itself. If PHP 7.1 would solve this, than we are on good path.

Personally I'd suggest making this possible at the core level, unless there's a very strong reason not to.

I agree, but you have to address Ryan for this and that's what we are doing here :-)

teppokoivula commented 7 years ago

The first version is created on windows using wfio, the second version is what windows do by itself.

I wouldn't see weird crap like that, because I don't use Windows ;)

Jokes aside, I was admittedly a bit worried that this might be a Windows-specific issue, and turns out it is. One option would be making this a configurable option and disabled by default, with proper warnings about Windows being a major jerk in this regard. That's what I'd do, anyway.

I agree, but you have to address Ryan for this and that's what we are doing here :-)

Definitely. Was just commenting what you said above, i.e. "I would rather leave the core files intact and make a module". I wouldn't :)

szabeszg commented 7 years ago

"...configurable option and disabled by default, with proper warnings about" any sort of incompatibility issues that might emerge. I support this :) Core support with some "shortcomings" regarding server compatibility issues is a lot better than having nothing. If ProcessWire can support most web servers out there, then that is a pretty solid start.

matjazpotocnik commented 7 years ago

One option would be making this a configurable option and disabled by default, with proper warnings about

Would this configurable option be part of input field or generic option in config.php?

about Windows being a major jerk in this regard.

I'm not linux/mac user so I can't make comments on this, but from my very limited testing, linux is not better in this regard. I suppose it has to be configured somehow to support filenames in UTF-8 (locale?)?

BitPoet commented 7 years ago

I'm not linux/mac user so I can't make comments on this, but from my very limited testing, linux is not better in this regard. I suppose it has to be configured somehow to support filenames in UTF-8 (locale?)?

UTF-8 filenames are supported by all standard file systems on *nix OSes. Problems usually only arise when the shell (command line) is configured to use a non-utf8 locale or old non-utf8 applications are invoked. All halfway current versions of Apache (and, importantly, also mod_rewrite) support utf8.

A check for the combination of Windows + PHP < 7.1 together with a big red warning should IMHO be a sensible approach.

Though, to keep things simple at first, just making the necessary methods hookable through a PR and putting things into a module might still be the quicker way and let Ryan sleep easier. Once there has been some successful production testing, things could be moved into the core.

matjazpotocnik commented 7 years ago

UTF-8 filenames are supported by all standard file systems on *nix OSes. Problems usually only arise when the shell (command line) is configured to use a non-utf8 locale or old non-utf8 applications are invoked. All halfway current versions of Apache (and, importantly, also mod_rewrite) support utf8.

I know UTF-8 filenames are supported on *nix system, but from testing (thx tpr) I conducted, filenames are not stored in UTF-8, my guess is that you have to convince Apache+php that you would like to work with UTF-8 encoding. How you do that if you are on shared hosting and don't have access to shell to setup locale (if this is what you are talking about)? My simple test was with

file_put_contents ("Árvíztűrő tükörfúrógép.txt", "data");

And I see the file:

Árvíztűrő tükörfúrógép.txt

BitPoet commented 7 years ago

The characters in Apache-generated file listings should be shown correctly if you set AddDefaultCharset UTF-8 in http.conf or, in the .htaccess in the directory with the files, IndexOptions +Charset=UTF-8 as document in the mod_autoindex docs.

teppokoivula commented 7 years ago

A check for the combination of Windows + PHP < 7.1 together with a big red warning should IMHO be a sensible approach.

Technically yes, but unexpected things can still happen if the site is moved to another server etc. This would be fine as an addition, as long as there's a clear warning that's always visible :)

teppokoivula commented 7 years ago

My simple test was with ... And I see the file ...

So far I've been unable to reproduce this, seems to work just fine on a pretty out-of-the-box Ubuntu installation at least. Are you seeing strange characters in the actual filename on the disk (via shell) or in a file listing, i.e. in a browser? If it's in a file listing, could you check the file name on the disk just to make sure? :)

matjazpotocnik commented 7 years ago

I finally got hands on linux box with root shell access. I got Ubuntu 16.10 with Apache 2 and PHP 7.0. I had to tweak the configuration of apache+php to make it work, but it looks like it's working.

On my previous tests on linux, the server was not properly configured for UTF-8. One server was creating files in ISO-8859-1 encoding, the other in ISO-8859-2. While windows stores the file in UTF-16 encoding internally, it performs conversion to the configured locale, in my case Windows-1250. Uploads are working on windows too, (on IIS 8.5) PHP 7.1 is manadtory! Attached are two recordings as proof of concept.

Changes to the core files are minimal, so I think there is no need for the module. I didn't make a PR as I think Ryan might go his own route (if at all), so I have rather created a zip file with changed files (PW3.0.42) so if anyone is going to try this, just replace the core files, there is readme.txt with instructions and what is changed.

https://www.dropbox.com/s/h1by4bm8j49jo7o/Upload%20demo%20windows.gif https://www.dropbox.com/s/cuu1tg7ie83li26/Upload%20demo%20linux.gif https://www.dropbox.com/s/dduaqkd6r68m8gn/Upload%20demo.zip

gideonso commented 7 years ago

Looks good. I will test it and report back.

gideonso commented 7 years ago

Hi @matjazpotocnik ,

Work nicely. How about make a PR and see if @ryancramerdesign would like to make it to the core??

Gideon

matjazpotocnik commented 7 years ago

There are a lot of PRs already in the queue and just making another one won't help. Ryan will decide how, when and if this will find a way to the core.

teppokoivula commented 6 years ago

Loosely related topic on the support forum: https://processwire.com/talk/topic/18354-no-lowercase-unzipped-files/. I'm still hoping that we can one day instruct ProcessWire to just keep filenames as-is. There are legitimate use cases for that.

Ping @ryancramerdesign.

Toutouwai commented 6 years ago

Alpha proof-of-concept module for anyone interested in exploring the idea: https://github.com/Toutouwai/FieldtypeFileUnrenamed/

gideonso commented 6 years ago

@Toutouwai I install the module and it doesn't seem to have any effect. Do I missed anything??

Toutouwai commented 6 years ago

@gideonso, maybe you didn't create a new "Files Unrenamed" field? If that's not it then sorry, I don't know. That module is just a proof-of-concept demonstration - it's not a released module that I'm providing support for I'm afraid.

gideonso commented 6 years ago

@Toutouwai , it is OK. Just wrote to see if you have any idea. Let's wait for the official solution if it comes one day.

gideonso commented 1 year ago

Almost 7 years since I opened this request, Still waiting for a proper solution. @ryancramerdesign any interest to make this into the core??

BernhardBaumrock commented 1 year ago

@gideonso I don't understand the problem (as I've never had it myself in 9 years). The description is a little short.

Is the problem that such a file can't be uploaded? Or is the problem that the file get's renamed and has a different name after upload?

If it's the former I can understand that this is annoying. If it's the latter it would be nice to give an explanation WHY this is such a big problem for you.

gideonso commented 1 year ago

@BernhardBaumrock Yes, files with non-ascii characters still can be uploaded but all the non-ascii characters are replaced. Not all users are good in English. Some of them need to use Chinese characters in file name. When they upload the file to ProcessWire, all the file name become not readable. It is very unfriendly.

BernhardBaumrock commented 1 year ago

Optional: Screenshots/Links that demonstrate the enhancement Your screenshots/links go here.

Could you please add screenshots so that people that never ever work with non-ascii filenames can better understand how that looks like and why the problem is so prominent?

Personally I've never ever worried about PW changing filenames but it sounds like what you suggest would be helpful to others as well, so in your situation I'd try to make it as clear and obvious for others (like me or ryan) to understand (see) the problem. That would likely increase the chance of being heard.

Waiting for a solution for years is maybe not the best option you have ;) Ryan is doing all the work for free and there are often requests that simply fade away and nobody asks for them any more. It would be more than inefficient to solve old problems that are not an issue any more for whatever reason.

If your problem still persists sometimes it also helps to rephrase it. For example I've created a request once for tabulator.info and simply got rejected. A year later I made the same request but used just slightly different words and boom - the author jumped on the train, saw the benefit and bumped a completely new version with a totally new event concept (https://tabulator.info/docs/5.4/events-internal)

BernhardBaumrock commented 1 year ago

PS: Also I'd try to explain what you did to try to make Robins module work for you and describe exactly what is not working and why that is not a solution (as it sounds to me like it's exactly doing what you requested).

teppokoivula commented 1 year ago

If it's the former I can understand that this is annoying. If it's the latter it would be nice to give an explanation WHY this is such a big problem for you.

Another use case is that sometimes you have files that, for one reason or another, should retain their original name. For an example back in the days I was working on a tool that bundled content into "HTML banners" that would then be uploaded to the site by the client.

In theory super easy and would've worked perfectly fine with file field — it was a nice bonus that the files were bundled in a ZIP file, which ProcessWire automatically unzipped — except that the software that our client used created files with non-ASCII characters in them, so uploading them to ProcessWire would break references here and there.

Of course we can build custom file fields / upload tools to handle anything that requires this, but it would've been pretty neat not to have to do that.

Just for the record: I'm not involved with aforementioned project anymore, and haven't had to deal with this issue in years, apart from a few requests from clients. As such I don't have a strong interest in this issue myself. Just wanted to point out that yes, this is still an active issue — and yes, it can still make things impossible or complicated for some use cases :)

gideonso commented 1 year ago

I added some screen shots.

The first and second one show that I try to upload a pdf file with Chinese and English characters. After uploaded to the backend, the Chinese characters are removed.

The third and fourth one show that I try to upload a pdf file with only Chinese characters. After uploaded to the backend, the Chinese characters are removed and the file was renamed to page_resources_files.pdf.

BernhardBaumrock commented 1 year ago

Thx for the screenshots and clarification @gideonso

I've had a look into InputfieldFile and unless I've missed something it seems it's not easy/possible to do with hooks, but the original filename is obviously there at some point (in processInput).

I'm not sure if it would be possible (or a good idea) to support non-ascii filenames, but maybe @ryancramerdesign can save the original filename of the uploaded file to a new property of PageFile, like uploadName or originalName or such. Would that be a proper solution for you?

matjazpotocnik commented 1 year ago

"...new property of PageFile, like uploadName or such". Oh, good idea, I never thought of that! That might work in some situations, but I guess there are other cases where the uploaded filename has to stay as is?

BernhardBaumrock commented 1 year ago

@gideonso I've just liked the issue, maybe that helps to draw attention to it. You could also ask people in the forum to like the issue as well if they think it is a good idea or they had problems with it themselves.

ryancramerdesign commented 1 year ago

While from a safety standpoint I think we have to limit what is allowed in the filename, I like the idea @BernhardBaumrock mentioned about storing the original filename, so that it is at least available if you need it. I will add this so that you can access it from a $pagefile->uploadName() method

gideonso commented 1 year ago

@BernhardBaumrock Thanks for joining to promote this request. @ryancramerdesign How about when we need to make a link to the file in CKEditor or TinyMce?? I think it will show the modified name rather than the soon will be add uploadName? Make link in the Textarea is the real pain point for us here.

ryancramerdesign commented 1 year ago

I've added this so that it now stores the original filename with the file and can be accessed from $pagefile->uploadName(). Note that it is unsanitized so could potentially contain dangerous stuff in it, but at least it's there for those that might need it for one thing or another. I also updated InputfieldFile and InputfieldImage to display it in a tooltip.

BernhardBaumrock commented 1 year ago

@ryancramerdesign we had a user in the forum that reported that the uploadName property is not available (or sanitized) when adding files via the API:

$path = "/var/www/dvmrebuild/storage/";
$fileName = "My_Test_File3$$$.pdf";

$p = $pages->get(1214);

$p->of(false);
$p->venue_files->add($path . $fileName);
$p->save('venue_files');

foreach ($p->venue_files AS $vfile) {
    echo $vfile->name . ' => ' . html_entity_decode($vfile->uploadName) . '<br>';
}

# Outputs:
# my_test_file3.pdf => my_test_file3.pdf

This looks like a bug to me, no?

Here is the forum thread for reference and further details about the use case: https://processwire.com/talk/topic/28957-using-pagefile-uploadname-to-offer-downloads-of-files-with-their-original-filename/#comment-235449

gideonso commented 11 months ago

Happy 7th anniversary for this issue. Still in great need for this feature. Still hoping there will be a proper solution.

adrianbj commented 11 months ago

@gideonso - I haven't been following it too closely, but could the new uploadName() method be used in a hook (perhaps InputfieldFile::fileAdded) to rename the file? Maybe it will result in issues accessing the file or interacting with it in the PW admin - not sure, but thought I'd throw it out there as an idea.

gideonso commented 11 months ago

@adrianbj - Hey. Maybe worth a try. Will test it and let you know the result.

Toutouwai commented 11 months ago

@gideonso, I've posted a tutorial to the forum that has a couple of tips for transliterating non-latin characters and for showing the original filename when linking in CKEditor or TinyMCE. Might be something helpful there? https://processwire.com/talk/topic/29273-more-tips-for-pagefile-uploadname/

gideonso commented 11 months ago

@Toutouwai Wow! This is indeed helpful. At lease we can see the original name. Wonderful.

gideonso commented 10 months ago

@matjazpotocnik I finally made the changes you suggested a few years ago and it still works well. Thanks for your effort. It really helps.