guzba / zippy

Pure Nim implementation of deflate, zlib, gzip and zip.
MIT License
246 stars 29 forks source link

uses magic number rather than extension #20

Closed mantielero closed 3 years ago

mantielero commented 3 years ago

This PR aims to rely on the magic number rather than the extension. The rationale is that I face files that are .tar files, but they change the extension to something else.

This is the first time I make a PR, so not sure I did it right. I am not pro-dev so forgive my code quality.

guzba commented 3 years ago

Hi and thanks for the PR. Sorry for my slowness in taking a look.

I think you have a good idea here. It is smart to look at the file header to determine the format instead of just trusting the file extensions.

I'll be taking a look at this PR now and will have more thoughts after I've considered the idea and approach.

guzba commented 3 years ago

Hm, so I do have some concern with this approach. For example, it is possible for a valid uncompressed .tar file to be incorrectly identified as compressed format.

Do you know what file format these files are? If so, could you change their extensions before opening them?

If you do not know the format, you could inspect it and change the extension before calling through to zippy.

My concern is I do not want to break opening some valid tarball files in order to support opening files with incorrect file extensions. If the file is a .tar.gz, I should always treat it as a gzip compressed tarball because otherwise it I may break things for somebody else.

Alternatively, I could add an another open proc that allows you to specify the format. Something like tarball.open(data, tfGzip) or tarball.open(data, tfUncompressed) that kind of thing. Would that potentially be helpful?

mantielero commented 3 years ago

I am not sure if I am understanding your first sentence. A valid uncompressed tar file starts with the name[100]. So in order to identify wrongly as a compressed .tar, the name of the file embedded in the tar file should start with one of this sequences:

1F 8B: .tgz  
1F 9D: tar.z (tar zip) Lempel-Ziv-Welch   
1F a0: tar.z (tar zip) LZH  
FD 37 7A 58 5A 00: .tar.xz

I think that "1F" is not printable. The same occurs with "00". So those are quite unlikely for a filename.

guzba commented 3 years ago

I understand these are unlikely, but it is not invalid. I would be breaking some .tar files in an admittedly unlikely way to fix an equally strange situation for you (where the file extensions don't match for some reason).

I have concluded that if I receive a request to open a .tar.gz file, I should fail if it isn't a valid tar.gz file. Not secretly / internally re-interpret it as a different file format.

I do not like the idea of ignoring file extensions if they are present. It is similar to other libraries I work on. If someone calls a method to open a .png, I'm going to fail if the isn't a .png. I'm not going to detect its actually a .jpg or .svg and do that instead. This is my opinion.

I am open to adding a different version which takes a stream of bytes (string or stringstream or something) that can detect format. This would work for you and doesn't come with the extra file extension information causing issues for me. I think this is the correct way forward. You'd just readFile() then pass the string in to the function, instead of passing the file path directly.

You also didn't answer my question. Can't you just rename these files quick before opening them? It sounds like you already know the valid extension.

mantielero commented 3 years ago

Yes you are right. Please reject the PR.

guzba commented 3 years ago

Quick update here. First, to clarify: I do like the idea of supporting detecting the tarball format. I simply felt that ignoring the file extensions wasn't the right way to go about it.

I have added a new method in tarballs.nim that should work very well for you:

proc open*(tarball: Tarball, stream: Stream, tarballFormat = tfDetect)

You'd use it something like this:

let fs = newFileStream("tests/data/tarballs/dir.tar.gz")
defer: fs.close()

let tarball = Tarball()
tarball.open(fs) # tfDetect is the default so you don't need to pass it
...

This is included in the new zippy release 0.5.10. Let me know if you have any issues with it or if it doesn't quite work for you to see if it can be improved.