selmf / unarr

A decompression library for rar, tar, zip and 7z archives
GNU Lesser General Public License v3.0
73 stars 13 forks source link

encoding problems in zip files #9

Closed comicfans closed 4 years ago

comicfans commented 5 years ago

zip filename encoding can be many variants , which unarr not handled (I tried some zip with unarr, which didn't return correct filename), could unarr add interface to read raw filename instead of trying to decode them ? so client can try harder to guess correct encoding

comicfans commented 5 years ago

I take a quick hack here, is this acceptable ?

selmf commented 5 years ago

I'm not sure if adding format-specific functions like this is a good idea. Most of unarr's code wasn't written by me, so I have to study the issue myself first. If unarr gives you the wrong encoding, that is a bug and that bug should be fixed.

comicfans commented 5 years ago

I agree that unarr should keep interface generic and avoid format-specific function. unarr currently only try to convert from cp437 (seems this is from zip spec?), but in practice this is not the case , zip file encoding is a long standing problem , there're many zip file with non-standard encoding. afaik it's impossible to know actual encoding without some 'guess' work .

selmf commented 5 years ago

I don't think unarr needs to know the actual encoding. According to the zip specification the file headers must contain a flag indicating if the file name is encoded in UTF-8. So if this flag is set, it should be enough to just return the raw UTF-8 string as you suggested.

comicfans commented 5 years ago

seems that problem happend if flag is not set to utf8. it takes

 zip->entry.name = ar_conv_dos_to_utf8(name);

which do cp437 => utf8 convert, but this gives 'incorrect' result . taken from previous blog link

The Zip spec says that the only supported encodings are CP437 and UTF-8, 
but everyone has ignored that. 
Implementers just encode file names however they want 

there is also other zip tool which suffer from such problem, for example unzip.

because unarr will replace slash in filename before return result to user, so convert back 'incorrect' result back to raw is also not possible (maybe possible ?)

comicfans commented 5 years ago

with some checking, I think it is possible to convert back 'incorrect decoded' utf8 back to raw string( cp437 table used by unarr has unique utf8 mapping, and slash is compatible for both). the only problem is that cp437 has two dialects , unarr used the one from wiki (with wingdings) , not the DOSLatinUS . by using uchardet to guess correct encoding of raw string , I finally got correct result.

integrate uchardet into unarr may be infeasible, some API comment for zip filename (the problem and solution) will be helpful for developers. since caller didn't know the filename is converted from cp437 or just raw string as utf8, the problem still exists.

selmf commented 5 years ago

I see. This is more complicated than I thought. I agree that unarr should provide a mechanism that allows access to the raw filename, but this has to be implemented in a more generic way. Ideally, unarr would not do any character decoding at all and leave this step to the caller. Instead of adding a raw_filename entry we could add a character_encoding entry and keep the filename raw. That way the caller could decide to use it raw, handle conversion himself or use unarr's conversion.

comicfans commented 5 years ago

That's better approach , I hope new API keeps clean and easy as previously. I'm afraid it may break ABI and introduces extra work for simple usecases ,since encoding may not be a problem for other formats.

selmf commented 5 years ago

I'm not interested in making the API unnecessary complicated or breaking ABI. I will do some research on how other libraries or unpackers are handling this and then decide what I am going to do.

My current idea is to have something like this:

const char ar_entry_get_raw_name(ar_archive ar); unarr_encoding ar_entry_get_encoding(ar_archive *ar);

Unarr_encoding will be an enum, NULL will indicate unknown encoding. The old ar_entry_get_name function will stay. But that is my idea now. If I have a better idea I will do it differently :)

selmf commented 4 years ago

I recently had some time to think about this issue again and I would like to include your solution in the next unarr release. There is no use in waiting for a more general approach when a simple solution exists and I can always map the zip specific raw name function to a more generic solution if I ever implement it.

For the PR, it would be great if you could implement the following changes:

We also should discuss how to describe this function in the header. There should be information on why this function exists and how to handle the raw filenames (uchardet or other method).

comicfans commented 4 years ago

Thanks for consider including this as solution, but currently I have very limited spare time to work on this, I'll reconstruct my commit as your suggestion in next few days.

selmf commented 4 years ago

No problem. I just picked up work again recently, so there is no hurry and I let you code sit long enough, so a few days or even a bit longer won't be an issue.

selmf commented 4 years ago

Actually, if you're a bit short on time right now and rather would like me to finish the code I'd be fine with that too ;)

comicfans commented 4 years ago

Appreciated that if you can finish the code! seems that I still don't have too much spare time on this

selmf commented 4 years ago

Great! Just create a PR from your master and I'll take over from that. If you have the time, a quick rebase would also help ;)

comicfans commented 4 years ago

a little bit late... here it is

selmf commented 4 years ago

Question: How do you build/use the code and the function for the raw filename?

I've added a call to it to the unarr-test sample code and I get warnings about implicit function declaration and cast from int to pointer, followed by a segfault when I run it.

Note that this is not a complaint, I'm not really a C wizard myself and I'm just trying to understand why my sample application is complaining while it probably works for you :)

comicfans commented 4 years ago

@selmf

Seems that problem is due to incomplete CMakeLists.

in main CMakeLists unarr didn't define target public include directories, so under test dir example code looks into system for unarr.h (which is old one) instead of project's.

setting

target_include_directories(unarr PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})

in main CMakeLists resolve this. I suggest that moving unarr.h into a individual include dir as a complete solution.

selmf commented 4 years ago

Good catch. I tested if it was linking to the correct library but forgot to check if it includes the wrong header. A complete solution needs to take into account both local builds and system-installed headers, but now that I know what I am looking for this should be solvable :)

selmf commented 4 years ago

Ok, I finally finished this. I ended up modifying your code more than I was initially planning, but I think the end result is worth it.

To be specific, I had to add at least some boilerplate code to the other archive formats to prevent users from running the raw zip filename function on non-zip functions, so I added a "raw" parameter to the internal function for retrieving filenames and defaulted it to NULL on non-zip archives. Doing so also allowed me to merge the raw filename stuff into the regular zip name function, so I am quite happy with the outcome.

Feel free to test it a bit and give me some feedback :)

comicfans commented 4 years ago

Great!

Dennis-Zhang-SH commented 1 year ago

I don't think unarr needs to know the actual encoding. According to the zip specification the file headers must contain a flag indicating if the file name is encoded in UTF-8. So if this flag is set, it should be enough to just return the raw UTF-8 string as you suggested.

Is it able to provide a flag indicates that filename and file content may not encode in UTF-8?