tweag / chainsail

Replica Exchange sampling as-a-service
MIT License
11 stars 1 forks source link

Cannot unzip results.zip #459

Closed steshaw closed 1 year ago

steshaw commented 1 year ago

In local development with minikube, I'm finding that the results.zip that I get using the following command does not unzip:

kubectl exec -it scheduler-worker-<tab complete here> -- \
  curl --output - "<URL from download button>" > results.zip

This command can be found in the docs here.

Here is an example URL that I used.

Example file: results.zip.

Here's what happens when I tried to unzip:

$ unzip results.zip
Archive:  results.zip
warning [results.zip]:  418 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [results.zip]:  start of central directory not found;
  zipfile corrupt.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

It seems to be a zip file:

$ file results.zip
results.zip: Zip archive data, at least v2.0 to extract, compression method=deflate

My version of unzip is 6.00:

$ unzip -V
UnZip 6.00 of 20 April 2009, by Info-ZIP.  Maintained by C. Spieler.  Send
bug reports using http://www.info-zip.org/zip-bug.html; see README for details.

Usage: unzip [-Z] [-opts[modifiers]] file[.zip] [list] [-x xlist] [-d exdir]
  Default action is to extract files in list, except those in xlist, to exdir;
  file[.zip] may be a wildcard.  -Z => ZipInfo mode ("unzip -Z" for usage).

  -p  extract files to pipe, no messages     -l  list files (short format)
  -f  freshen existing files, create none    -t  test compressed archive data
  -u  update files, create if necessary      -z  display archive comment only
  -v  list verbosely/show version info       -T  timestamp archive to latest
  -x  exclude files that follow (in xlist)   -d  extract files into exdir
modifiers:
  -n  never overwrite existing files         -q  quiet mode (-qq => quieter)
  -o  overwrite files WITHOUT prompting      -a  auto-convert any text files
  -j  junk paths (do not make directories)   -aa treat ALL files as text
  -U  use escapes for all non-ASCII Unicode  -UU ignore any Unicode fields
  -C  match filenames case-insensitively     -L  make (some) names lowercase
  -X  restore UID/GID info                   -V  retain VMS version numbers
  -K  keep setuid/setgid/tacky permissions   -M  pipe through "more" pager
See "unzip -hh" or unzip.txt for more help.  Examples:
  unzip data1 -x joe   => extract all files except joe from zipfile data1.zip
  unzip -p foo | more  => send contents of foo.zip via pipe into program more
  unzip -fo foo ReadMe => quietly replace existing ReadMe if archive file newer

I found an article that suggested using zip -FF to fix broken zip files:

✗ zip -FF results.zip --out fixed.zip
Fix archive (-FF) - salvage what can
 Found end record (EOCDR) - says expect single disk archive
Scanning for entries...
 copying: /optimization_run0/config.yml  (241 bytes)
 copying: /optimization_run0/dos.pickle  (339 bytes)
 copying: /optimization_run0/energies/energies_replica1_0-500.pickle  (586 bytes)
 copying: /optimization_run0/energies/energies_replica1_500-1000.pickle  (610 bytes)
 copying: /optimization_run0/energies/energies_replica2_0-500.pickle  (620 bytes)
 copying: /optimization_run0/energies/energies_replica2_500-1000.pickle  (622 bytes)
 copying: /optimization_run0/energies/energies_replica3_0-500.pickle  (633 bytes)
 copying: /optimization_run0/energies/energies_replica3_500-1000.pickle  (637 bytes)
 copying: /optimization_run0/energies/energies_replica4_0-500.pickle  (635 bytes)
 copying: /optimization_run0/energies/energies_replica4_500-1000.pickle  (623 bytes)
 copying: /optimization_run0/energies/energies_replica5_0-500.pickle  (601 bytes)
 copying: /optimization_run0/energies/energies_replica5_500-1000.pickle  (602 bytes)
 copying: /optimization_run0/final_stepsizes.pickle  (170 bytes)
 copying: /optimization_run0/samples/samples_replica1_0-500.pickle  (1860 bytes)
 copying: /optimization_run0/samples/samples_replica1_500-1000.pickle  (1925 bytes)
 copying: /optimization_run0/samples/samples_replica2_0-500.pickle  (1970 bytes)
 copying: /optimization_run0/samples/samples_replica2_500-1000.pickle  (2002 bytes)
 copying: /optimization_run0/samples/samples_replica3_0-500.pickle  (2008 bytes)
 copying: /optimization_run0/samples/samples_replica3_500-1000.pickle  (2049 bytes)
 copying:   (526861 bytes)

Then I can unzip the fixed.zip file with some warnings:

$ unzip ../fixed.zip
Archive:  ../fixed.zip
warning:  filename too long--truncating.
warning:  stripped absolute path spec from /optimization_run0/config.yml
  inflating: optimization_run0/config.yml
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/dos.pickle
  inflating: optimization_run0/dos.pickle   bad CRC 87eb77ab  (should be b8a5f015)
warning:  stripped absolute path spec from /optimization_run0/energies/energies_replica1_0-500.pickle
  inflating: optimization_run0/energies/energies_replica1_0-500.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/energies/energies_replica1_500-1000.pickle
  inflating: optimization_run0/energies/energies_replica1_500-1000.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/energies/energies_replica2_0-500.pickle
  inflating: optimization_run0/energies/energies_replica2_0-500.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/energies/energies_replica2_500-1000.pickle
  inflating: optimization_run0/energies/energies_replica2_500-1000.pickle  (incomplete l-tree)
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/energies/energies_replica3_0-500.pickle
  inflating: optimization_run0/energies/energies_replica3_0-500.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/energies/energies_replica3_500-1000.pickle
  inflating: optimization_run0/energies/energies_replica3_500-1000.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/energies/energies_replica4_0-500.pickle
  inflating: optimization_run0/energies/energies_replica4_0-500.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/energies/energies_replica4_500-1000.pickle
  inflating: optimization_run0/energies/energies_replica4_500-1000.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/energies/energies_replica5_0-500.pickle
  inflating: optimization_run0/energies/energies_replica5_0-500.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/energies/energies_replica5_500-1000.pickle
  inflating: optimization_run0/energies/energies_replica5_500-1000.pickle
warning:  stripped absolute path spec from /optimization_run0/final_stepsizes.pickle
  inflating: optimization_run0/final_stepsizes.pickle   bad CRC 1a7b966d  (should be ac3b76e8)
warning:  stripped absolute path spec from /optimization_run0/samples/samples_replica1_0-500.pickle
  inflating: optimization_run0/samples/samples_replica1_0-500.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/samples/samples_replica1_500-1000.pickle
  inflating: optimization_run0/samples/samples_replica1_500-1000.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/samples/samples_replica2_0-500.pickle
  inflating: optimization_run0/samples/samples_replica2_0-500.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/samples/samples_replica2_500-1000.pickle
  inflating: optimization_run0/samples/samples_replica2_500-1000.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/samples/samples_replica3_0-500.pickle
  inflating: optimization_run0/samples/samples_replica3_0-500.pickle
  error:  invalid compressed data to inflate
warning:  stripped absolute path spec from /optimization_run0/samples/samples_replica3_500-1000.pickle
  inflating: optimization_run0/samples/samples_replica3_500-1000.pickle
  error:  invalid compressed data to inflate
warning:  filename too long--truncating.
mapname:  conversion of  failed
steshaw commented 1 year ago

I tried p7zip instead of unzip. Perhaps these errors will be illuminating:

$ nix-shell -p p7zip --command '7z e results.zip'

7-Zip [64] 17.05 : Copyright (c) 1999-2021 Igor Pavlov : 2017-08-28
p7zip Version 17.05 (locale=en_AU.UTF-8,Utf16=on,HugeFiles=on,64 bits,16 CPUs x64)

Scanning the drive for archives:
1 file, 107224 bytes (105 KiB)

Extracting archive: results.zip

ERRORS:
Headers Error
Unconfirmed start of archive

WARNINGS:
There are data after the end of archive

--
Path = results.zip
Type = zip
ERRORS:
Headers Error
Unconfirmed start of archive
WARNINGS:
There are data after the end of archive
Physical Size = 304
Tail Size = 106920
Characteristics = Local

ERROR: Data Error : /optimization_run0/config.yml

Sub items Errors: 1

Archives with Errors: 1

Warnings: 1

Open Errors: 1

Sub items Errors: 1
simeoncarstens commented 1 year ago

I can confirm that something is broken with the zip file. When I run a simulation and download the zip file according to the instructions and then use unzip (under NixOS) to unzip it, I get

Archive:  results.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of results.zip or
        results.zip.zip, and cannot find results.zip.ZIP, period.

Note that this is a different error than @steshaw gets, with the same unzip version: in his case, "start of central directory not found", and in my case, "end-of-central-directory signature not found". This is all very mysterious :thinking: I'll check with a git blame if something in the zipping code changed (I don't believe it did).

simeoncarstens commented 1 year ago

The zipping of the results happens in https://github.com/tweag/chainsail/blob/main/app/scheduler/chainsail/scheduler/tasks.py#L166 and hasn't been changed since #372. And that definitely worked at the time, after all, the beta release happened and I recorded the walkthrough video. And even that change didn't touch the actual zipping code, only a bit where the blob path gets set. So I guess this behavior is due to some dependency update.

simeoncarstens commented 1 year ago

This (or more specifically, the comment) looks interesting and possibly relevant: https://stackoverflow.com/a/2463819/1656472 Note that we are currently using the "w" mode for writing into the BytesIO object via repeated zipfile.Zipfile.writestr calls, and not the "a" mode. Not sure it makes any difference - it did work before with "w".

steshaw commented 1 year ago

It's strange that you get a different error to me, so I've attached an example results.zip file in the main description above. When I unzip it, I get:

$ /usr/bin/unzip results.zip
Archive:  results.zip
warning [results.zip]:  319 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [results.zip]:  start of central directory not found;
  zipfile corrupt.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
steshaw commented 1 year ago

I created a little example Python script which creates an in-memory zip file and then writes it out to a file. The resultant zip file unzips without problem 🤔.

https://gist.github.com/steshaw/f090e76dab4c60bd7e91272ac20e0a68

Using mode of 'a' or of 'w' doesn't seem to make a difference. Both work.

simeoncarstens commented 1 year ago

It's strange that you get a different error to me, so I've attached an example results.zip file in the main description above.

Thanks for uploading that file! When I try to unzip it, I now get exactly the same error as you do. At least that's reassuring... Given that your example script works without issues, I'm wondering whether this might have something to do with the minio storage backend / Kubernetes setup. The minio version never changed, but what did likely change was the Terraform version via the nixpkgs bump) and, with https://github.com/tweag/chainsail/commit/2d26ae0ee4f9565917a2f3fd795eca3f63c0226a, the version of the Kubernetes provider. Maybe investigating along these lines might prove interesting :thinking: but really, I'm just guessing.

steshaw commented 1 year ago

The Terraform version shouldn't affect the scheduler operating within k8s. Here my current plan:

I feel sure that we're missing something more obvious 🤷‍♂️.

steshaw commented 1 year ago

The long and short of it is that I've found a fix and a PR is to follow! However, I've sketched the investigation below.

Checking Python within the container

I checked that the test script works in the container. I check directly on the k8s scheduler pod by connecting with kubectl exec.

Checking MinIO

I used the MinIO client, mc, to connect to MinIO and cp the zip file. I did this inside the cluster on the scheduler pod. This retrieves an uncorrupted zip file. To repeat you need something like this:

curl -O https://gist.githubusercontent.com/steshaw/c8368cf04e19185e9a5f0825cf49b7e8/raw/ebcb41c1358d869ca4441dade1f360980b2a7abc/download-mc
bash download-mc
~/.local/bin/mc alias set myminio http://minio.default.svc.cluster.local:9000 chainsail chainsail
~/.local/bin/mc cp myminio/chainsail-samples/storage/10/results.zip .
unzip -d out results.zip # to check it

Curling via port-forwarding

So, it seemed that there was nothing wrong with the zip file inside MinIO. I also checked access via port-fowarding.

On local machine, set up the port forward:

kubectl port-forward service/minio 8090:9000 # in one terminal

In a separate terminal:

curl --output - 'http://localhost:8090/chainsail-samples/storage/10/results.zip?AWSAccessKeyId=chainsail&Signature=ibYSALGPSprPiqNgRkQEpDgRfDc%3D&Expires=1681961986' >|r3zip

The problem

So curling works. Therefore there had to be a problem with our command from the docs. The corrupt file is longer, so additional bytes are getting inserted. There had to be something different about our command from the docs:

kubectl exec -it scheduler-worker-<tab complete here> -- \
  curl --output - "<URL from download button>" > results.zip

I tried adding --silent which didn't change the outcome. Finally, I wondered why are we using -it. We don't need an interactive tty. Removing these options solves the problem. In fact, just removing -i is sufficient.

Note, I imagine that your walkthrough video was done with Chainsail on GCP. This is a slightly different scenario where MinIO isn't used (the storage goes on GCS).

simeoncarstens commented 1 year ago

Wow :exploding_head: what a somewhat anticlimactic resolution to this issue, but excellent work tracking this down! Thanks a lot! :heart: I'm wondering how the -it slipped into the docs. I mean, I obviously wrote it, but why :thinking: all I can think of is some thing that I was subconsciously remembering the docker exec -it some_image /bin/bash incantation I have been typing a lot these last months. And excellent point for the walkthrough video, which I shot indeed with a GCP deployment! TIL also about the mc MiniO client, thanks! :slightly_smiling_face:

steshaw commented 1 year ago

I had a feeling that it was something simple that we'd overlooked. Don't sweat it. The little -it looks pretty typical and innocent. I'll be wary of this in the future.