umich-dbgroup / duoquest

Dual-specification query synthesis with natural language and table sketch queries
MIT License
4 stars 1 forks source link

socket.gaierror: [Errno -2] Name or service not known #2

Open Chamberlain0w0 opened 3 years ago

Chamberlain0w0 commented 3 years ago

Hello, I tried to run the code on Ubuntu 18.04, following the steps in the QuickStart, and it all went well. I successfully set up the web interface and I used 'docker logs -f dq-main' to inspect the realtime process in the terminal. However, when I tried to start a query task through the interface, that is, when I entered the query info and clicked the 'Run new query' button, I got this error in the logs in the terminal, repeatedly:

DuoquestServer listening on port 6001... Traceback (most recent call last): File "/home/duoquest/duoquest/nlq_client.py", line 21, in connect self.conn = Client(address, authkey=self.authkey) File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 492, in Client c = SocketClient(address) File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient s.connect(address) socket.gaierror: [Errno -2] Name or service not known

I wonder what the problem is. Is there something wrong with my network setting?

chrisjbaik commented 3 years ago

Hmm. What's happening is that the dq-main container is trying to connect to the dq-enum container on port 6000 (or whatever host/port is listed under the docker_cfg.ini file).

Either the network, or maybe the dq-enum container isn't set up correctly? Could you try looking at the log for that container and see what it says?

Chamberlain0w0 commented 3 years ago

Oh, I looked into the logs of dq-enum, I found that as soon as I run dq-main, the dq-enum container would stop instantly and report error as such:

Loading GloVE word embeddings... Loading word embedding from /workspace/syntaxSQL/glove/glove.42B.300d.txt Using fixed embedding Traceback (most recent call last): File "main.py", line 258, in main() File "main.py", line 169, in main config.get('syntaxsql', 'glove_path'), args.toy) File "main.py", line 92, in load_model table_type='std', use_hs=True) File "/workspace/syntaxSQL/supermodel.py", line 104, in init self.multi_sql = MultiSqlPredictor(N_word=N_word,N_h=N_h,N_depth=N_depth,gpu=gpu, use_hs=use_hs) File "/workspace/syntaxSQL/models/multisql_predictor.py", line 46, in init self.cuda() File "/opt/conda/lib/python2.7/site-packages/torch/nn/modules/module.py", line 258, in cuda return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python2.7/site-packages/torch/nn/modules/module.py", line 185, in _apply module._apply(fn) File "/opt/conda/lib/python2.7/site-packages/torch/nn/modules/rnn.py", line 113, in _apply self.flatten_parameters() File "/opt/conda/lib/python2.7/site-packages/torch/nn/modules/rnn.py", line 106, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS

chrisjbaik commented 3 years ago

Ah.. My guess is that it might have something to do with the GPU driver version and how it interacts with the CUDA version in the container... What type of GPU/what driver are you using?

Chamberlain0w0 commented 3 years ago

I doubted that at first so I checked the version, but it seemed to be all right. You see, when I enter nvidia-smi in the terminal, I get:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro RTX 6000 Off | 00000000:3B:00.0 Off | Off | | 31% 34C P0 1W / 260W | 0MiB / 24190MiB | 4% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

The GPU I used is Quadro RTX 6000, the Nvidia-driver version is 418.67 and the CUDA Version is 10.1. Although they are not the newest ones, the matching relationship seems right😢😢Shall I try to update the driver and Cuda?

chrisjbaik commented 3 years ago

Good question... I don't have an easy answer for this. I've always had to tinker around with the versions until it worked and it's pretty finicky...

Just to make sure, you did install this right? https://github.com/NVIDIA/nvidia-docker

Also this is the version for the machine I was running on... But I somehow doubt it will make a difference: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+

Chamberlain0w0 commented 3 years ago

Well, I think I have Installed Nvidia-docker right, because I can successfully run the example code they provided in the README and also, I successfully finished all the steps in QuickStart and I can get the webpage interface correctly...

I think I'll still try to update the versions of drivers and Cuda first, as it seems the only way to fix the problem now...

Anyway, genuine thanks for your help! 🙏😃

chrisjbaik commented 3 years ago

Ah I see. Hope it works! It's really unfortunate because in my ideal world this is exactly the type of problem Docker is supposed to fix :/

Chamberlain0w0 commented 3 years ago

Hello again, I have tried to update the driver version and Cuda version accordingly to the newest :

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+

But the problem is still there. 😭😭 The error reported in logs is the same:

Loading GloVE word embeddings... Loading word embedding from /workspace/syntaxSQL/glove/glove.42B.300d.txt Using fixed embedding Traceback (most recent call last): File "main.py", line 258, in main() File "main.py", line 169, in main config.get('syntaxsql', 'glove_path'), args.toy) File "main.py", line 92, in load_model table_type='std', use_hs=True) File "/workspace/syntaxSQL/supermodel.py", line 104, in init self.multi_sql = MultiSqlPredictor(N_word=N_word,N_h=N_h,N_depth=N_depth,gpu=gpu, use_hs=use_hs) File "/workspace/syntaxSQL/models/multisql_predictor.py", line 46, in init self.cuda() File "/opt/conda/lib/python2.7/site-packages/torch/nn/modules/module.py", line 258, in cuda return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python2.7/site-packages/torch/nn/modules/module.py", line 185, in _apply module._apply(fn) File "/opt/conda/lib/python2.7/site-packages/torch/nn/modules/rnn.py", line 113, in _apply self.flatten_parameters() File "/opt/conda/lib/python2.7/site-packages/torch/nn/modules/rnn.py", line 106, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS

chrisjbaik commented 3 years ago

Hmmmm. Again, no easy answer for this, as I had been using an older version of PyTorch so as to use the models/code trained for SyntaxSQLNet originally, which was made on Python 2. Googling pulls this up, which isn't super encouraging: https://discuss.pytorch.org/t/runtimeerror-cudnn-error-cudnn-status-success/28045/18

One thing to try is to build a new Docker container for dq-enum. The Dockerfile is in a git submodule (at enum/syntaxSQL) forked from SyntaxSQLNet (https://github.com/chrisjbaik/syntaxSQL/blob/e89832bdb621fc522be14250504c22d869c9cc1a/Dockerfile). Note that the top of the Dockerfile lists 2 additional files you should copy into your submodule directory before building - the GloVe pre-trained embeddings and the saved SyntaxSQLNet models, which you can find in the README.md of the submodule. My suggestion might be to try using a different PyTorch docker containers as the base container in the Dockerfile instead of FROM vanessa/pytorch-dev:py2, and/or installing different versions of CUDA toolkit in the Dockerfile and see if you have any luck :/