tryolabs / luminoth

Deep Learning toolkit for Computer Vision.
https://tryolabs.com
BSD 3-Clause "New" or "Revised" License
2.4k stars 399 forks source link

Evaluate function in Google Cloud #204

Closed shreyjasani closed 6 years ago

shreyjasani commented 6 years ago

I'm using Windows and Google Cloud. I'm getting the following error when using evaluate on my running job:

capture

dekked commented 6 years ago

Hello @shreyjasani!

Can you list the contents of the bucket? And in particular, the packages folder.

If the training did run, there must be a package in that folder. Maybe the line package_files = tf.gfile.ListDirectory(train_packages_dir) in gcloud.py is not returning what we expect, for some reason?

shreyjasani commented 6 years ago

Bucket contents of the training job:

capture

And specifically, the packages folder: capture

So I'm seeing the tar.gz file in packages so it might just be the packages_files line

dekked commented 6 years ago

Can you put a breakpoint before that line (pdb or ipdb) and see what's going on?

Given the content in the directory, it really looks like it shouldn't fail, unless tf.gfile.ListDirectory is returning empty for some reason in Windows.

dekked commented 6 years ago

Forgot to say... a workaround would be to run eval with --rebuild option. This should build and upload the package just like train did.

But it would be nice to get this fixed in case it's an easy to fix bug :)

shreyjasani commented 6 years ago

Thanks Alan! It worked with --rebuild

So is the tar.gz file generated by the evaluation job a checkpoint? I tried importing it directly, but it said that it doesn't have the metadata.json file to be a checkpoint

dekked commented 6 years ago

The .tar.gz is the package that contains Luminoth's source code, the same thing you see in GitHub here. It is needed by ML Engine, since all the dependencies that it installs for running your code/model must be encapsulated in a package.

The checkpoint is that is stored in the job's directory (in this case, bucket). This does have the metadata.json file you are describing.

I am still wondering why this failed without --rebuild, though. Don't know if ListDirectory is not returning what we expect, but anyway, when we do implement #208 we should clean up the CLI a bit and avoid this case altogether.