tryolabs / luminoth

Deep Learning toolkit for Computer Vision.
https://tryolabs.com
BSD 3-Clause "New" or "Revised" License
2.4k stars 399 forks source link

Error in submitting a training job into my Google Cloud account #201

Closed shreyjasani closed 6 years ago

shreyjasani commented 6 years ago

I am facing an issue with Google Cloud.

When I submit a training job to Google Cloud, I'm getting the following error:

File "c:\programdata\anaconda3\lib\site-packages\luminoth\tools\cloud\gcloud.py", line 207, in train package_path = build_package(bucket, base_path) File "c:\programdata\anaconda3\lib\site-packages\luminoth\tools\cloud\gcloud.py", line 79, in build_package tarball_filename = os.listdir(output_dir)[0] FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\Users\SHREYJ~1\AppData\Local\Temp\tmpf6du9vxb\output'

Essentially, gcloud.py is not creating an output folder in the folder Temp. When I modified gcloud.py code to create an output folder in Temp, even then, it is not writing any information into output and is resulting in the following error:

File "c:\programdata\anaconda3\lib\site-packages\luminoth\tools\cloud\gcloud.py", line 79, in build_package tarball_filename = os.listdir(output_dir)[0] IndexError: list index out of range

Can you help me with this issue?

dekked commented 6 years ago

Hello!

Unfortunately, Luminoth has not been tested in Windows. As none of us have Windows machines, this is not something we can reproduce. So I will be guessing here.

What is probably failing is the subprocess.call. You could try playing around with that and see what the processes write to stdout. Maybe it just can't find python in the path?

By the way, how did you modify the creation of the temporary folder to make it work?

shreyjasani commented 6 years ago

Ah I see, thanks for the quick response. I also believe subprocess.call is failing, and I think it isn't writing anything out to stdout. It is finding python in the path, so might be something else - and I'm not sure what. Let me know if you have any other ideas?

As for the modification, I simply added a mkdir command as shown in the image below:

capture

dekked commented 6 years ago

You should remove the stdout=dev_null and stderr=dev_null, which we are using for it not to pollute the standard output.

The reason you had to add that os.mkdir call is that Python itself (when doing the build) should be creating the directory. Since that is failing, you get the out of range error.

shreyjasani commented 6 years ago

Got it, thanks. I removed stdoutand stderr, but I'm running into a new error that its unable to find the setup.py file that is called during subprocess.call(). What do you think is happening now?

capture

dekked commented 6 years ago

I think it might have to do with the installation with Conda? (never tested this)

Check that the package_path that we build on the top of this function really points to the directory where setup.py is located. If it does not, then what is in this directory?

shreyjasani commented 6 years ago

Thanks for the tip. I re-installed using git clone instead of Conda, and I also modified gcloud.pyaccording to the commit [#202] and I was able to successfully submit a training job.

However, the training failed because of the following error highlighted in white below. I tried modifying my service account key for global GCP access, but that didn't help either. How could I fix this?

image

dekked commented 6 years ago

Can you try with latest gcloud.py (in same PR) and let me know?

It might have to do with mixing \ and / for the paths (shouldn't use os.path.join for remote paths as Windows has a different separator; I've fixed this now).

shreyjasani commented 6 years ago

Yes, the latest PR worked. The training still failed though due to this issue with git, highlighted in white:

capture

shreyjasani commented 6 years ago

^nevermind. I have fixed this. Thanks

dekked commented 6 years ago

Thanks for your help @shreyjasani!

I am closing this issue now :tada: