tryolabs / luminoth

Deep Learning toolkit for Computer Vision.
https://tryolabs.com
BSD 3-Clause "New" or "Revised" License
2.4k stars 400 forks source link

gcloud.py #229

Closed AshwinAce closed 5 years ago

AshwinAce commented 5 years ago

There appears to be a small error in the gcloud.py file in the luminoth/tools/cloud folder.When we do not give a bucket argument for storing logs, this error is triggered. Line 226 contains bucket_name = 'luminoth-{}'.formata(account.client_id) I adjusted that to bucket_name = 'luminoth-{}'.format(account.client_id) and it seems to work fine now.

AshwinAce commented 5 years ago

When I run lumi cloud gc jobs, I get Id: train_20181101_174123 Created: 2018-11-01T21:41:30Z State: FAILED sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=6, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('192.168.1.4', 55354), raddr=('108.177.8.95', 443)> sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=5, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('192.168.1.4', 59232), raddr=('172.217.195.95',443)>``

I am unsure if this happened because of the previous change. However, running the original code resulted in File "/home/ace/luminoth/luminoth/tools/cloud/gcloud.py", line 226, in train bucket_name = 'luminoth-{}'.formata(account.client_id) AttributeError: 'str' object has no attribute 'formata'

Also Luminoth ran on my computer without any problems, so I am unsure why there is failure here. The bucker where my dataset tf records were uploaded to is us-east1 while the bucket where it the logs would have been stored is in US.

dekked commented 5 years ago

Hi @AshwinAce!

Thanks for your report. This is a legit typo, and I have fixed it here https://github.com/tryolabs/luminoth/commit/4b81238ece406f4692707f616db6a3594e0078ec.

As for the warning, is it only a warning? Does the job not get submitted to ML Engine? What version of Python are you using?

AshwinAce commented 5 years ago

The job gets submitted, it runs for a while and then it fails. I tried executing it another time with the same results. I am using Python 3.6.5.

One possible thing I did which might be a problem was that I tried changing num_epochs inside the config.yml file. I'm not sure whether that change worked, however it still runs in my laptop while failing in the cloud.

dekked commented 5 years ago

If the job gets submitted successfully and does start, it means it's not failing because of this warning, so lumi cloud gc worked :D

Now, you must investigate why the job itself fails. This is a different issue, so I'm closing this. It will be helpful to look at the logs of the job in ML Engine; you can use lumi cloud gc logs for that.