Closed shreyjasani closed 6 years ago
Hello @shreyjasani!
Can you list the contents of the bucket? And in particular, the packages
folder.
If the training did run, there must be a package in that folder. Maybe the line package_files = tf.gfile.ListDirectory(train_packages_dir)
in gcloud.py
is not returning what we expect, for some reason?
Bucket contents of the training job:
And specifically, the packages folder:
So I'm seeing the tar.gz file in packages
so it might just be the packages_files
line
Can you put a breakpoint before that line (pdb
or ipdb
) and see what's going on?
Given the content in the directory, it really looks like it shouldn't fail, unless tf.gfile.ListDirectory
is returning empty for some reason in Windows.
Forgot to say... a workaround would be to run eval
with --rebuild
option. This should build and upload the package just like train
did.
But it would be nice to get this fixed in case it's an easy to fix bug :)
Thanks Alan! It worked with --rebuild
So is the tar.gz file generated by the evaluation job a checkpoint? I tried importing it directly, but it said that it doesn't have the metadata.json
file to be a checkpoint
The .tar.gz is the package that contains Luminoth's source code, the same thing you see in GitHub here. It is needed by ML Engine, since all the dependencies that it installs for running your code/model must be encapsulated in a package.
The checkpoint is that is stored in the job's directory (in this case, bucket). This does have the metadata.json
file you are describing.
I am still wondering why this failed without --rebuild
, though. Don't know if ListDirectory
is not returning what we expect, but anyway, when we do implement #208 we should clean up the CLI a bit and avoid this case altogether.
I'm using Windows and Google Cloud. I'm getting the following error when using evaluate on my running job: