TypeError("len() of a 0-d tensor")

rzhangpku / MFAE

Source code for SDM 2020 paper "What Do Questions Exactly Ask? MFAE: Duplicate Question Identification with Multi-Fusion Asking Emphasis"

MIT License

15 stars 2 forks source link

TypeError("len() of a 0-d tensor") #2

Closed withchencheng closed 4 years ago

withchencheng commented 4 years ago

因为国内网络不好，https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing 下不来。我自己从kaggle下了404301对train。将kaggle提供的约40万带标签数据随机分为新的train/dev/test，train: 370011条，dev：20000条，test：14290条。preprocess后，运行bert_quora.py报错。在第一次validate dev时出错，附上我的dev dev.tsv.zip

====================  Preparing for training  ====================
    * Loading training data...
    * Loading validation data...
    * Loading test data...
    * Building model...
/data/cc/opt/anaconda3/lib/python3.7/site-packages/bert_serving/client/__init__.py:299: UserWarning: some of your sentences have more tokens than "max_seq_len=25" set on the server, as consequence you may get less-accurate or truncated embeddings.
here is what you can do:
- disable the length-check by create a new "BertClient(check_length=False)" when you do not want to display this warning
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)
/opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
Traceback (most recent call last):
  File "/data/cc/pycharm/MFAE/utils_bert.py", line 127, in validate
    logits, probs = model(premises, hypotheses)
  File "/data/cc/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/cc/pycharm/MFAE/mfae/model_bert.py", line 102, in forward
    encoded_premises = self._encoding(premises, premises_lengths)
  File "/data/cc/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/cc/pycharm/MFAE/mfae/layers_new.py", line 214, in forward
    sort_by_seq_lens(sequences_batch, sequences_lengths)
  File "/data/cc/pycharm/MFAE/mfae/utils.py", line 41, in sort_by_seq_lens
    idx_range = torch.arange(0, len(sequences_lengths)).to(sequences_lengths.device)
  File "/data/cc/opt/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 451, in __len__
    raise TypeError("len() of a 0-d tensor")
TypeError: len() of a 0-d tensor

Process finished with exit code 1

可以麻烦您提供一下您使用的quora原始train/dev/test数据吗？但可能不是这个数据的问题，代码在model_bert.py
premises_lengths = premises_mask.sum(dim=-1).long() 就出问题了，debug显示 premises_lengths.shape为torch.Size([])

rzhangpku commented 4 years ago

链接: https://pan.baidu.com/s/1GYlmDRzame1liGZLKOV1kA 密码: enej 这是quora的train/dev/test数据。你先用这个数据看看能不能解决这个错误

withchencheng commented 4 years ago

运行bert_quora.py还是报一样的错误。在第一次validate dev时出错

另外，您写的预处理脚本文件名对不上号[https://github.com/rzhangpku/MFAE#preprocess-the-data-by-bert] process_quora_bert.py -> preprocess_quora_bert.py

rzhangpku commented 4 years ago

你有用https://github.com/hanxiao/bert-as-service 吗

withchencheng commented 4 years ago

用了的，正确开启了服务。

withchencheng commented 4 years ago

您确定您自己跑这份代码是正确的吗

rzhangpku commented 4 years ago

Preprocess the data by BERT

cd scripts/preprocessing
python process_quora_bert.py

数据预处理部分也跑了吗

withchencheng commented 4 years ago

跑了啊。。我怀疑是model_bert.py里面 premises_lengths = premises_mask.sum(dim=-1).long() 这块附近有问题

withchencheng commented 4 years ago

这里生成的premises_lengths不是code comment里面描述的一串长度的list

Mrzhouqifei commented 4 years ago

====================  Preparing for training  ====================
  * Loading training data...
  * Loading validation data...
  * Loading test data...
  * Building model...
/data/cc/opt/anaconda3/lib/python3.7/site-packages/bert_serving/client/__init__.py:299: UserWarning: some of your sentences have more tokens than "max_seq_len=25" set on the server, as consequence you may get less-accurate or truncated embeddings.
here is what you can do:
- disable the length-check by create a new "BertClient(check_length=False)" when you do not want to display this warning
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)
/opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
Traceback (most recent call last):
  File "/data/cc/pycharm/MFAE/utils_bert.py", line 127, in validate
    logits, probs = model(premises, hypotheses)
  File "/data/cc/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/cc/pycharm/MFAE/mfae/model_bert.py", line 102, in forward
    encoded_premises = self._encoding(premises, premises_lengths)
  File "/data/cc/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/cc/pycharm/MFAE/mfae/layers_new.py", line 214, in forward
    sort_by_seq_lens(sequences_batch, sequences_lengths)
  File "/data/cc/pycharm/MFAE/mfae/utils.py", line 41, in sort_by_seq_lens
    idx_range = torch.arange(0, len(sequences_lengths)).to(sequences_lengths.device)
  File "/data/cc/opt/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 451, in __len__
    raise TypeError("len() of a 0-d tensor")
TypeError: len() of a 0-d tensor

Process finished with exit code 1

可以麻烦您提供一下您使用的quora原始train/dev/test数据吗？但可能不是这个数据的问题，代码在model_bert.py premises_lengths = premises_mask.sum(dim=-1).long() 就出问题了，debug显示 premises_lengths.shape为torch.Size([])

这个shape为空错误肯定是输入的数据的问题，你可以在输入之前打印下你的数据是否正常。另外train正常么？

withchencheng commented 4 years ago

在训练之前，运行validate dev/test均出现同样错误。训练之时，在mode_bert.py 的forward 函数还是出现同样错误。

检查mode_bert.py 的forward 函数里面的premises_lengths 和 hypotheses_lengths 都是scalar，值为64，计算方法是

def forward(self, premises, hypotheses):
    premises_mask = (torch.sum(premises, dim=-1) != 0).float()
    premises_lengths = premises_mask.sum(dim=-1).long()

(hypotheses 和 premises是一样的代码，只列出其中一个) 与代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”不一致。 mode_bert.py 的forward 函数里面 premises和hypotheses 的shape都是torch.Size([64, 768]) (我看这里好像是bert service给出的sentence embedding向量结果，本意应该是word index？)。往上回溯一层到utils_bert.py的validate函数，premises和hypotheses 的计算方法是

     premises = torch.tensor(bc.encode(batch["premises"][batch_index])).to(device) #torch.Size([64, 768])
     logits, probs = model(premises, hypotheses) # 传递给forward

这一行中

batch_index=0
batch['premises']={dict:157} #看起来是premise index to premist sentence

所以代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”是什么意思？是premises内含单词数量的列表吗？那么它就不应该是BertClient给出的统一的768。您设计的premises_lengths正常的计算流程应该是怎样的？

Mrzhouqifei commented 4 years ago

在训练之前，运行validate dev/test均出现同样错误。训练之时，在mode_bert.py 的forward 函数还是出现同样错误。

检查mode_bert.py 的forward 函数里面的premises_lengths 和 hypotheses_lengths 都是scalar，值为64，计算方法是
def forward(self, premises, hypotheses):
    premises_mask = (torch.sum(premises, dim=-1) != 0).float()
    premises_lengths = premises_mask.sum(dim=-1).long()
(hypotheses 和 premises是一样的代码，只列出其中一个) 与代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”不一致。 mode_bert.py 的forward 函数里面 premises和hypotheses 的shape都是torch.Size([64, 768]) (我看这里好像是bert service给出的sentence embedding向量结果，本意应该是word index？)。往上回溯一层到utils_bert.py的validate函数，premises和hypotheses 的计算方法是
     premises = torch.tensor(bc.encode(batch["premises"][batch_index])).to(device) #torch.Size([64, 768])
     logits, probs = model(premises, hypotheses) # 传递给forward
这一行中
batch_index=0
batch['premises']={dict:157} #看起来是premise index to premist sentence
所以代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”是什么意思？是premises内含单词数量的列表吗？那么它就不应该是BertClient给出的统一的768。您设计的premises_lengths正常的计算流程应该是怎样的？

BertClient会默认将一个batch的句子补充到一样的长度（不足的补0）,因此我们用premises_lengths去还原出原始的每一个句子的长度。代码比较久了，肯能是某些细节没对上，这里的shape正常情况下应该是[batch_size, sequence_length, bert_embedding_dim], 你BertService返回的似乎没有sequence_length?因此少了一维，就不是所说的1D的array了。还有问题的话，晚上回去或者周末帮你check一下之前的代码。

Mrzhouqifei commented 4 years ago

在训练之前，运行validate dev/test均出现同样错误。训练之时，在mode_bert.py 的forward 函数还是出现同样错误。检查mode_bert.py 的forward 函数里面的premises_lengths 和 hypotheses_lengths 都是scalar，值为64，计算方法是
def forward(self, premises, hypotheses):
    premises_mask = (torch.sum(premises, dim=-1) != 0).float()
    premises_lengths = premises_mask.sum(dim=-1).long()
(hypotheses 和 premises是一样的代码，只列出其中一个) 与代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”不一致。 mode_bert.py 的forward 函数里面 premises和hypotheses 的shape都是torch.Size([64, 768]) (我看这里好像是bert service给出的sentence embedding向量结果，本意应该是word index？)。往上回溯一层到utils_bert.py的validate函数，premises和hypotheses 的计算方法是
     premises = torch.tensor(bc.encode(batch["premises"][batch_index])).to(device) #torch.Size([64, 768])
     logits, probs = model(premises, hypotheses) # 传递给forward
这一行中
batch_index=0
batch['premises']={dict:157} #看起来是premise index to premist sentence
所以代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”是什么意思？是premises内含单词数量的列表吗？那么它就不应该是BertClient给出的统一的768。您设计的premises_lengths正常的计算流程应该是怎样的？
BertClient会默认将一个batch的句子补充到一样的长度（不足的补0）,因此我们用premises_lengths去还原出原始的每一个句子的长度。代码比较久了，肯能是某些细节没对上，这里的shape正常情况下应该是[batch_size, sequence_length, bert_embedding_dim], 你BertService返回的似乎没有sequence_length?因此少了一维，就不是所说的1D的array了。还有问题的话，晚上回去或者周末帮你check一下之前的代码。

直觉上应该是你输入bertClient的不是一个句子？或者启动bertclient时某些参数导致返回的不是一个句子？我们是将一整个句子放入bertclient中，再返回一整个句子

withchencheng commented 4 years ago

utils_bert.py的validate函数中

batch["premises"][batch_index]# 是一个list of 64 premises str，没有经过tokenization。【这里输入BertClient的格式对吗？】
bc.encode(batch["premises"][batch_index]) #ndarray:(64,768) 缺少token length信息

版本信息如下： bert-serving-client==1.10.0 bert-serving-server==1.10.0

withchencheng commented 4 years ago

直觉上应该是你输入bertClient的不是一个句子？或者启动bertclient时某些参数导致返回的不是一个句子？我们是将一整个句子放入bertclient中，再返回一整个句子

您原代码输入Bert Client的参数是

batch["premises"][batch_index]# 是一个list of 64 premises str，没有经过tokenization。【这里输入BertClient的格式对吗？】

启动Bert Client仅仅改了端口号。麻烦您确认一下输入Bert Client的参数是一个原始premise的batch吗（lowercase的问题文本的list）？

Mrzhouqifei commented 4 years ago

直觉上应该是你输入bertClient的不是一个句子？或者启动bertclient时某些参数导致返回的不是一个句子？我们是将一整个句子放入bertclient中，再返回一整个句子

您原代码输入Bert Client的参数是
batch["premises"][batch_index]# 是一个list of 64 premises str，没有经过tokenization。【这里输入BertClient的格式对吗？】
启动Bert Client仅仅改了端口号。麻烦您确认一下输入Bert Client的参数是一个原始premise的batch吗（lowercase的问题文本的list）？

1）我跑了一下代码我这边还是正常的，你启动Bert Server时按以下参数试试看。 bert-serving-start -pooling_strategy NONE -model_dir /xxx/xxx/Bert/cased_L-12_H-768_A-12/ -max_seq_len NONE -gpu_memory_fraction 0.4 2）在quora_training_bert.json文件中把"embedding_size": 768,改成和你bert对应的，（768或1024） 3）在config/preprocessing/quora_preprocessing.json中check下你对应数据文件的地址。运行我们的preprocess_quora_bert.py之后数据格式上应该是不用做任何额外修改就能跑通的。

withchencheng commented 4 years ago

谢谢！应该是bert启动参数的问题，现在貌似在跑着，我再确认一下

withchencheng commented 4 years ago

应该是bert server启动参数的问题。使用bert as a server的默认启动参数是不行的，没有返回原premise的word length。现在我的启动参数是：

bert-serving-start  -pooling_strategy NONE   -max_seq_len NONE -num_worker=4 -model_dir /data/cc/data/uncased_L-12_H-768_A-12

我猜关键是 pooling_strategy 和 max_seq_len 这两个。不过我就不一一验证了。

跑起来了，就是非常慢。一个epoch要7~9h, 总共要64个epoch？。等最终结果。

系统版本：

Ubuntu 16.04.6 LTS
Memory: 503GB

GPU snapshot:
Wed Jul 22 20:11:06 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   65C    P0    62W / 250W |  20341MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:86:00.0 Off |                    0 |
| N/A   50C    P0    53W / 250W |    590MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     56019      C   /home/wqy/anaconda3/bin/python               145MiB |
|    0     72345      C   ...ngyu/anaconda3/envs/python37/bin/python 10041MiB |
|    0     76182      C   /opt/omnisci/bin/omnisci_server              145MiB |
|    0    181406      C   ...ngyu/anaconda3/envs/python37/bin/python  1651MiB |
**|    0    199869      C   /data/cc/opt/anaconda3/bin/python            145MiB |(我bert service)**
**|    0    199877      C   /data/cc/opt/anaconda3/bin/python            145MiB |(我bert service)**
**|    0    200656      C   python                                      8057MiB |(我 python bert_quora.py)**
|    1     56019      C   /home/wqy/anaconda3/bin/python               145MiB |
|    1     76182      C   /opt/omnisci/bin/omnisci_server              145MiB |
**|    1    199866      C   /data/cc/opt/anaconda3/bin/python            145MiB |(我bert service)
|    1    199873      C   /data/cc/opt/anaconda3/bin/python            145MiB |(我bert service)**
+-----------------------------------------------------------------------------+

目前的程序输出：

====================  Preparing for training  ====================
    * Loading training data...
    * Loading validation data...
    * Loading test data...
    * Building model...
/data/cc/opt/anaconda3/lib/python3.7/site-packages/bert_serving/client/__init__.py:290: UserWarning: server does not put a restriction on "max_seq_len", it will determine "max_seq_len" dynamically according to the sequences in the batch. you can restrict the sequence length on the client side for better efficiency
  warnings.warn('server does not put a restriction on "max_seq_len", '
/opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
    * Validation loss before training: 0.6934, accuracy: 50.0000%
    * test loss before training: 0.6935, accuracy: 50.0000%

 ==================== Training ESIM model on device: cuda:0 ====================
* Training epoch 1:
Avg. batch proc. time: 5.2738s, loss: 0.3885: 100%|██████████| 6006/6006 [8:48:14<00:00,  5.28s/it]   
-> Training time: 31694.2238s, loss = 0.3885, accuracy: 81.6432%
* Validation for epoch 1:
-> Valid. time: 711.8302s, loss: 0.3473, accuracy: 85.1000%

* Test for epoch 1:
-> Test. time: 692.6380s, loss: 0.3592, accuracy: 83.8300%

* Training epoch 2:
Avg. batch proc. time: 4.7860s, loss: 0.2966:  11%|█▏        | 679/6006 [54:11<6:28:52,  4.38s/it]

一个iteration要5s，一个epoch要7~9h, 总共要64个epoch？起码要等20多天才能跑出最佳效果吗？Is it supposed to be that slow?

Mrzhouqifei commented 4 years ago

应该是bert server启动参数的问题。使用bert as a server的默认启动参数是不行的，没有返回原premise的word length。现在我的启动参数是：

bert-serving-start  -pooling_strategy NONE   -max_seq_len NONE -num_worker=4 -model_dir /data/cc/data/uncased_L-12_H-768_A-12

我猜关键是 pooling_strategy 和 max_seq_len 这两个。不过我就不一一验证了。

跑起来了，就是非常慢。一个epoch要7~9h, 总共要64个epoch？。等最终结果。

系统版本：

Ubuntu 16.04.6 LTS
Memory: 503GB

GPU snapshot:
Wed Jul 22 20:11:06 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   65C    P0    62W / 250W |  20341MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:86:00.0 Off |                    0 |
| N/A   50C    P0    53W / 250W |    590MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     56019      C   /home/wqy/anaconda3/bin/python               145MiB |
|    0     72345      C   ...ngyu/anaconda3/envs/python37/bin/python 10041MiB |
|    0     76182      C   /opt/omnisci/bin/omnisci_server              145MiB |
|    0    181406      C   ...ngyu/anaconda3/envs/python37/bin/python  1651MiB |
**|    0    199869      C   /data/cc/opt/anaconda3/bin/python            145MiB |(我bert service)**
**|    0    199877      C   /data/cc/opt/anaconda3/bin/python            145MiB |(我bert service)**
**|    0    200656      C   python                                      8057MiB |(我 python bert_quora.py)**
|    1     56019      C   /home/wqy/anaconda3/bin/python               145MiB |
|    1     76182      C   /opt/omnisci/bin/omnisci_server              145MiB |
**|    1    199866      C   /data/cc/opt/anaconda3/bin/python            145MiB |(我bert service)
|    1    199873      C   /data/cc/opt/anaconda3/bin/python            145MiB |(我bert service)**
+-----------------------------------------------------------------------------+

目前的程序输出：

====================  Preparing for training  ====================
  * Loading training data...
  * Loading validation data...
  * Loading test data...
  * Building model...
/data/cc/opt/anaconda3/lib/python3.7/site-packages/bert_serving/client/__init__.py:290: UserWarning: server does not put a restriction on "max_seq_len", it will determine "max_seq_len" dynamically according to the sequences in the batch. you can restrict the sequence length on the client side for better efficiency
  warnings.warn('server does not put a restriction on "max_seq_len", '
/opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
  * Validation loss before training: 0.6934, accuracy: 50.0000%
  * test loss before training: 0.6935, accuracy: 50.0000%

 ==================== Training ESIM model on device: cuda:0 ====================
* Training epoch 1:
Avg. batch proc. time: 5.2738s, loss: 0.3885: 100%|██████████| 6006/6006 [8:48:14<00:00,  5.28s/it]   
-> Training time: 31694.2238s, loss = 0.3885, accuracy: 81.6432%
* Validation for epoch 1:
-> Valid. time: 711.8302s, loss: 0.3473, accuracy: 85.1000%

* Test for epoch 1:
-> Test. time: 692.6380s, loss: 0.3592, accuracy: 83.8300%

* Training epoch 2:
Avg. batch proc. time: 4.7860s, loss: 0.2966:  11%|█▏        | 679/6006 [54:11<6:28:52,  4.38s/it]

一个iteration要5s，一个epoch要7~9h, 总共要64个epoch？起码要等20多天才能跑出最佳效果吗？Is it supposed to be that slow?

你用GPU了么？用GPU我这边几十分钟就一个epoch，而且不用等跑完64个epoch，最多十几个左右基本就收敛了

withchencheng commented 4 years ago

用了Tesla P40，我把GPU监控的结果贴在上面了，process id是200656 。。。你用的哪一款GPU

Mrzhouqifei commented 4 years ago

用了Tesla P40，我把GPU监控的结果贴在上面了，process id是200656 。。。你用的哪一款GPU

我这运行1个epoch35分钟左右，GPU是2080Ti。

====================  Preparing for training  ====================
        * Loading training data...
        * Loading validation data...
        * Loading test data...
        * Building model...
/home/qifeiz/anaconda3/lib/python3.7/site-packages/bert_serving/client/__init__.py:290: UserWarning: server does not put a restriction on "max_seq_len", it will determine "max_seq_len" dynamically according to the sequences in the batch. you can restrict the sequence length on the client side for better efficiency
  warnings.warn('server does not put a restriction on "max_seq_len", '
        * Validation loss before training: 0.6932, accuracy: 50.0000%
        * test loss before training: 0.6932, accuracy: 50.0000%

 ==================== Training MFAE model on device: cuda:1 ====================
* Training epoch 1:
Avg. batch proc. time: 0.3913s, loss: 0.5037:  15%|█▎       | 871/6006 [05:41<30:55,  2.77it/s]

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
| 68%   75C    P2   144W / 250W |   4449MiB / 11019MiB |     47%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:04:00.0 Off |                  N/A |
| 68%   64C    P2    72W / 250W |   3528MiB / 11019MiB |      9%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     32478      C   /home/qifeiz/anaconda3/bin/python           4437MiB |
|    1     13391      G   /usr/lib/xorg/Xorg                            16MiB |
|    1     32604      C   python                                      3499MiB |
+-----------------------------------------------------------------------------+

我看你似乎bert serve占用的显存特别小，只有145MB,应该是这个限制了你程序的运行速度。我这边占用了4000多的显存。