Closed withchencheng closed 4 years ago
链接: https://pan.baidu.com/s/1GYlmDRzame1liGZLKOV1kA 密码: enej 这是quora的train/dev/test数据。 你先用这个数据看看能不能解决这个错误
运行bert_quora.py还是报一样的错误。在第一次validate dev时出错
另外,您写的预处理脚本文件名对不上号[https://github.com/rzhangpku/MFAE#preprocess-the-data-by-bert]
process_quora_bert.py
-> preprocess_quora_bert.py
你有用https://github.com/hanxiao/bert-as-service 吗
用了的,正确开启了服务。
您确定您自己跑这份代码是正确的吗
Preprocess the data by BERT
cd scripts/preprocessing
python process_quora_bert.py
数据预处理部分也跑了吗
跑了啊。。我怀疑是model_bert.py里面
premises_lengths = premises_mask.sum(dim=-1).long()
这块附近有问题
这里生成的premises_lengths不是code comment里面描述的一串长度的list
因为国内网络不好,https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing 下不来。我自己从kaggle下了404301对train。将kaggle提供的约40万带标签数据随机分为新的train/dev/test,train: 370011条,dev:20000条,test:14290条。preprocess后,运行bert_quora.py报错。 在第一次validate dev时出错,附上我的dev dev.tsv.zip
==================== Preparing for training ==================== * Loading training data... * Loading validation data... * Loading test data... * Building model... /data/cc/opt/anaconda3/lib/python3.7/site-packages/bert_serving/client/__init__.py:299: UserWarning: some of your sentences have more tokens than "max_seq_len=25" set on the server, as consequence you may get less-accurate or truncated embeddings. here is what you can do: - disable the length-check by create a new "BertClient(check_length=False)" when you do not want to display this warning - or, start a new server with a larger "max_seq_len" '- or, start a new server with a larger "max_seq_len"' % self.length_limit) /opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. Traceback (most recent call last): File "/data/cc/pycharm/MFAE/utils_bert.py", line 127, in validate logits, probs = model(premises, hypotheses) File "/data/cc/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/data/cc/pycharm/MFAE/mfae/model_bert.py", line 102, in forward encoded_premises = self._encoding(premises, premises_lengths) File "/data/cc/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/data/cc/pycharm/MFAE/mfae/layers_new.py", line 214, in forward sort_by_seq_lens(sequences_batch, sequences_lengths) File "/data/cc/pycharm/MFAE/mfae/utils.py", line 41, in sort_by_seq_lens idx_range = torch.arange(0, len(sequences_lengths)).to(sequences_lengths.device) File "/data/cc/opt/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 451, in __len__ raise TypeError("len() of a 0-d tensor") TypeError: len() of a 0-d tensor Process finished with exit code 1
可以麻烦您提供一下您使用的quora原始train/dev/test数据吗?但可能不是这个数据的问题,代码在model_bert.py
premises_lengths = premises_mask.sum(dim=-1).long()
就出问题了,debug显示premises_lengths.shape
为torch.Size([])
这个shape为空错误肯定是输入的数据的问题,你可以在输入之前打印下你的数据是否正常。另外train正常么?
在训练之前,运行validate dev/test均出现同样错误。
训练之时,在mode_bert.py 的forward
函数还是出现同样错误。
检查mode_bert.py 的forward
函数里面的premises_lengths
和 hypotheses_lengths
都是scalar,值为64,计算方法是
def forward(self, premises, hypotheses):
premises_mask = (torch.sum(premises, dim=-1) != 0).float()
premises_lengths = premises_mask.sum(dim=-1).long()
(hypotheses 和 premises是一样的代码,只列出其中一个)
与代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”不一致。
mode_bert.py 的forward
函数里面 premises
和hypotheses
的shape都是torch.Size([64, 768])
(我看这里好像是bert service给出的sentence embedding向量结果,本意应该是word index?)。
往上回溯一层到utils_bert.py的validate
函数,premises
和hypotheses
的计算方法是
premises = torch.tensor(bc.encode(batch["premises"][batch_index])).to(device) #torch.Size([64, 768])
logits, probs = model(premises, hypotheses) # 传递给forward
这一行中
batch_index=0
batch['premises']={dict:157} #看起来是premise index to premist sentence
所以代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”是什么意思?是premises内含单词数量的列表吗?那么它就不应该是BertClient
给出的统一的768。您设计的premises_lengths正常的计算流程应该是怎样的?
在训练之前,运行validate dev/test均出现同样错误。 训练之时,在mode_bert.py 的
forward
函数还是出现同样错误。检查mode_bert.py 的
forward
函数里面的premises_lengths
和hypotheses_lengths
都是scalar,值为64,计算方法是def forward(self, premises, hypotheses): premises_mask = (torch.sum(premises, dim=-1) != 0).float() premises_lengths = premises_mask.sum(dim=-1).long()
(hypotheses 和 premises是一样的代码,只列出其中一个) 与代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”不一致。 mode_bert.py 的
forward
函数里面premises
和hypotheses
的shape都是torch.Size([64, 768])
(我看这里好像是bert service给出的sentence embedding向量结果,本意应该是word index?)。 往上回溯一层到utils_bert.py的validate
函数,premises
和hypotheses
的计算方法是premises = torch.tensor(bc.encode(batch["premises"][batch_index])).to(device) #torch.Size([64, 768]) logits, probs = model(premises, hypotheses) # 传递给forward
这一行中
batch_index=0 batch['premises']={dict:157} #看起来是premise index to premist sentence
所以代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”是什么意思?是premises内含单词数量的列表吗?那么它就不应该是
BertClient
给出的统一的768。您设计的premises_lengths正常的计算流程应该是怎样的?
BertClient会默认将一个batch的句子补充到一样的长度(不足的补0),因此我们用premises_lengths去还原出原始的每一个句子的长度。代码比较久了,肯能是某些细节没对上,这里的shape正常情况下应该是[batch_size, sequence_length, bert_embedding_dim], 你BertService返回的似乎没有sequence_length?因此少了一维,就不是所说的1D的array了。还有问题的话,晚上回去或者周末帮你check一下之前的代码。
在训练之前,运行validate dev/test均出现同样错误。 训练之时,在mode_bert.py 的
forward
函数还是出现同样错误。 检查mode_bert.py 的forward
函数里面的premises_lengths
和hypotheses_lengths
都是scalar,值为64,计算方法是def forward(self, premises, hypotheses): premises_mask = (torch.sum(premises, dim=-1) != 0).float() premises_lengths = premises_mask.sum(dim=-1).long()
(hypotheses 和 premises是一样的代码,只列出其中一个) 与代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”不一致。 mode_bert.py 的
forward
函数里面premises
和hypotheses
的shape都是torch.Size([64, 768])
(我看这里好像是bert service给出的sentence embedding向量结果,本意应该是word index?)。 往上回溯一层到utils_bert.py的validate
函数,premises
和hypotheses
的计算方法是premises = torch.tensor(bc.encode(batch["premises"][batch_index])).to(device) #torch.Size([64, 768]) logits, probs = model(premises, hypotheses) # 传递给forward
这一行中
batch_index=0 batch['premises']={dict:157} #看起来是premise index to premist sentence
所以代码注释描述的“premises_lengths: A 1D tensor containing the lengths of the premises in 'premises'”是什么意思?是premises内含单词数量的列表吗?那么它就不应该是
BertClient
给出的统一的768。您设计的premises_lengths正常的计算流程应该是怎样的?BertClient会默认将一个batch的句子补充到一样的长度(不足的补0),因此我们用premises_lengths去还原出原始的每一个句子的长度。代码比较久了,肯能是某些细节没对上,这里的shape正常情况下应该是[batch_size, sequence_length, bert_embedding_dim], 你BertService返回的似乎没有sequence_length?因此少了一维,就不是所说的1D的array了。还有问题的话,晚上回去或者周末帮你check一下之前的代码。
直觉上应该是你输入bertClient的不是一个句子?或者启动bertclient时某些参数导致返回的不是一个句子?我们是将一整个句子放入bertclient中,再返回一整个句子
utils_bert.py的validate
函数中
batch["premises"][batch_index]# 是一个list of 64 premises str,没有经过tokenization。【这里输入BertClient的格式对吗?】
bc.encode(batch["premises"][batch_index]) #ndarray:(64,768) 缺少token length信息
版本信息如下: bert-serving-client==1.10.0 bert-serving-server==1.10.0
直觉上应该是你输入bertClient的不是一个句子?或者启动bertclient时某些参数导致返回的不是一个句子?我们是将一整个句子放入bertclient中,再返回一整个句子
您原代码输入Bert Client的参数是
batch["premises"][batch_index]# 是一个list of 64 premises str,没有经过tokenization。【这里输入BertClient的格式对吗?】
启动Bert Client仅仅改了端口号。 麻烦您确认一下输入Bert Client的参数是一个原始premise的batch吗(lowercase的问题文本的list)?
直觉上应该是你输入bertClient的不是一个句子?或者启动bertclient时某些参数导致返回的不是一个句子?我们是将一整个句子放入bertclient中,再返回一整个句子
您原代码输入Bert Client的参数是
batch["premises"][batch_index]# 是一个list of 64 premises str,没有经过tokenization。【这里输入BertClient的格式对吗?】
启动Bert Client仅仅改了端口号。 麻烦您确认一下输入Bert Client的参数是一个原始premise的batch吗(lowercase的问题文本的list)?
1)我跑了一下代码我这边还是正常的,你启动Bert Server时按以下参数试试看。 bert-serving-start -pooling_strategy NONE -model_dir /xxx/xxx/Bert/cased_L-12_H-768_A-12/ -max_seq_len NONE -gpu_memory_fraction 0.4 2)在quora_training_bert.json文件中把"embedding_size": 768,改成和你bert对应的,(768或1024) 3)在config/preprocessing/quora_preprocessing.json中check下你对应数据文件的地址。 运行我们的preprocess_quora_bert.py之后数据格式上应该是不用做任何额外修改就能跑通的。
谢谢!应该是bert启动参数的问题,现在貌似在跑着,我再确认一下
应该是bert server启动参数的问题。使用bert as a server的默认启动参数是不行的,没有返回原premise的word length。 现在我的启动参数是:
bert-serving-start -pooling_strategy NONE -max_seq_len NONE -num_worker=4 -model_dir /data/cc/data/uncased_L-12_H-768_A-12
我猜关键是 pooling_strategy
和 max_seq_len
这两个。不过我就不一一验证了。
跑起来了,就是非常慢。一个epoch要7~9h, 总共要64个epoch?。等最终结果。
系统版本:
Ubuntu 16.04.6 LTS
Memory: 503GB
GPU snapshot:
Wed Jul 22 20:11:06 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:3B:00.0 Off | 0 |
| N/A 65C P0 62W / 250W | 20341MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:86:00.0 Off | 0 |
| N/A 50C P0 53W / 250W | 590MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 56019 C /home/wqy/anaconda3/bin/python 145MiB |
| 0 72345 C ...ngyu/anaconda3/envs/python37/bin/python 10041MiB |
| 0 76182 C /opt/omnisci/bin/omnisci_server 145MiB |
| 0 181406 C ...ngyu/anaconda3/envs/python37/bin/python 1651MiB |
**| 0 199869 C /data/cc/opt/anaconda3/bin/python 145MiB |(我bert service)**
**| 0 199877 C /data/cc/opt/anaconda3/bin/python 145MiB |(我bert service)**
**| 0 200656 C python 8057MiB |(我 python bert_quora.py)**
| 1 56019 C /home/wqy/anaconda3/bin/python 145MiB |
| 1 76182 C /opt/omnisci/bin/omnisci_server 145MiB |
**| 1 199866 C /data/cc/opt/anaconda3/bin/python 145MiB |(我bert service)
| 1 199873 C /data/cc/opt/anaconda3/bin/python 145MiB |(我bert service)**
+-----------------------------------------------------------------------------+
目前的程序输出:
==================== Preparing for training ====================
* Loading training data...
* Loading validation data...
* Loading test data...
* Building model...
/data/cc/opt/anaconda3/lib/python3.7/site-packages/bert_serving/client/__init__.py:290: UserWarning: server does not put a restriction on "max_seq_len", it will determine "max_seq_len" dynamically according to the sequences in the batch. you can restrict the sequence length on the client side for better efficiency
warnings.warn('server does not put a restriction on "max_seq_len", '
/opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
* Validation loss before training: 0.6934, accuracy: 50.0000%
* test loss before training: 0.6935, accuracy: 50.0000%
==================== Training ESIM model on device: cuda:0 ====================
* Training epoch 1:
Avg. batch proc. time: 5.2738s, loss: 0.3885: 100%|██████████| 6006/6006 [8:48:14<00:00, 5.28s/it]
-> Training time: 31694.2238s, loss = 0.3885, accuracy: 81.6432%
* Validation for epoch 1:
-> Valid. time: 711.8302s, loss: 0.3473, accuracy: 85.1000%
* Test for epoch 1:
-> Test. time: 692.6380s, loss: 0.3592, accuracy: 83.8300%
* Training epoch 2:
Avg. batch proc. time: 4.7860s, loss: 0.2966: 11%|█▏ | 679/6006 [54:11<6:28:52, 4.38s/it]
一个iteration要5s,一个epoch要7~9h, 总共要64个epoch?起码要等20多天才能跑出最佳效果吗?Is it supposed to be that slow?
应该是bert server启动参数的问题。使用bert as a server的默认启动参数是不行的,没有返回原premise的word length。 现在我的启动参数是:
bert-serving-start -pooling_strategy NONE -max_seq_len NONE -num_worker=4 -model_dir /data/cc/data/uncased_L-12_H-768_A-12
我猜关键是
pooling_strategy
和max_seq_len
这两个。不过我就不一一验证了。跑起来了,就是非常慢。一个epoch要7~9h, 总共要64个epoch?。等最终结果。
系统版本:
Ubuntu 16.04.6 LTS Memory: 503GB GPU snapshot: Wed Jul 22 20:11:06 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P40 Off | 00000000:3B:00.0 Off | 0 | | N/A 65C P0 62W / 250W | 20341MiB / 22919MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P40 Off | 00000000:86:00.0 Off | 0 | | N/A 50C P0 53W / 250W | 590MiB / 22919MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 56019 C /home/wqy/anaconda3/bin/python 145MiB | | 0 72345 C ...ngyu/anaconda3/envs/python37/bin/python 10041MiB | | 0 76182 C /opt/omnisci/bin/omnisci_server 145MiB | | 0 181406 C ...ngyu/anaconda3/envs/python37/bin/python 1651MiB | **| 0 199869 C /data/cc/opt/anaconda3/bin/python 145MiB |(我bert service)** **| 0 199877 C /data/cc/opt/anaconda3/bin/python 145MiB |(我bert service)** **| 0 200656 C python 8057MiB |(我 python bert_quora.py)** | 1 56019 C /home/wqy/anaconda3/bin/python 145MiB | | 1 76182 C /opt/omnisci/bin/omnisci_server 145MiB | **| 1 199866 C /data/cc/opt/anaconda3/bin/python 145MiB |(我bert service) | 1 199873 C /data/cc/opt/anaconda3/bin/python 145MiB |(我bert service)** +-----------------------------------------------------------------------------+
目前的程序输出:
==================== Preparing for training ==================== * Loading training data... * Loading validation data... * Loading test data... * Building model... /data/cc/opt/anaconda3/lib/python3.7/site-packages/bert_serving/client/__init__.py:290: UserWarning: server does not put a restriction on "max_seq_len", it will determine "max_seq_len" dynamically according to the sequences in the batch. you can restrict the sequence length on the client side for better efficiency warnings.warn('server does not put a restriction on "max_seq_len", ' /opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. * Validation loss before training: 0.6934, accuracy: 50.0000% * test loss before training: 0.6935, accuracy: 50.0000% ==================== Training ESIM model on device: cuda:0 ==================== * Training epoch 1: Avg. batch proc. time: 5.2738s, loss: 0.3885: 100%|██████████| 6006/6006 [8:48:14<00:00, 5.28s/it] -> Training time: 31694.2238s, loss = 0.3885, accuracy: 81.6432% * Validation for epoch 1: -> Valid. time: 711.8302s, loss: 0.3473, accuracy: 85.1000% * Test for epoch 1: -> Test. time: 692.6380s, loss: 0.3592, accuracy: 83.8300% * Training epoch 2: Avg. batch proc. time: 4.7860s, loss: 0.2966: 11%|█▏ | 679/6006 [54:11<6:28:52, 4.38s/it]
一个iteration要5s,一个epoch要7~9h, 总共要64个epoch?起码要等20多天才能跑出最佳效果吗?Is it supposed to be that slow?
你用GPU了么?用GPU我这边几十分钟就一个epoch,而且不用等跑完64个epoch,最多十几个左右基本就收敛了
用了Tesla P40, 我把GPU监控的结果贴在上面了,process id是200656 。。。你用的哪一款GPU
用了Tesla P40, 我把GPU监控的结果贴在上面了,process id是200656 。。。你用的哪一款GPU
我这运行1个epoch35分钟左右,GPU是2080Ti。
==================== Preparing for training ====================
* Loading training data...
* Loading validation data...
* Loading test data...
* Building model...
/home/qifeiz/anaconda3/lib/python3.7/site-packages/bert_serving/client/__init__.py:290: UserWarning: server does not put a restriction on "max_seq_len", it will determine "max_seq_len" dynamically according to the sequences in the batch. you can restrict the sequence length on the client side for better efficiency
warnings.warn('server does not put a restriction on "max_seq_len", '
* Validation loss before training: 0.6932, accuracy: 50.0000%
* test loss before training: 0.6932, accuracy: 50.0000%
==================== Training MFAE model on device: cuda:1 ====================
* Training epoch 1:
Avg. batch proc. time: 0.3913s, loss: 0.5037: 15%|█▎ | 871/6006 [05:41<30:55, 2.77it/s]
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:02:00.0 Off | N/A |
| 68% 75C P2 144W / 250W | 4449MiB / 11019MiB | 47% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:04:00.0 Off | N/A |
| 68% 64C P2 72W / 250W | 3528MiB / 11019MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 32478 C /home/qifeiz/anaconda3/bin/python 4437MiB |
| 1 13391 G /usr/lib/xorg/Xorg 16MiB |
| 1 32604 C python 3499MiB |
+-----------------------------------------------------------------------------+
我看你似乎bert serve占用的显存特别小,只有145MB,应该是这个限制了你程序的运行速度。我这边占用了4000多的显存。
因为国内网络不好,https://drive.google.com/file/d/0B0PlTAo--BnaQWlsZl9FZ3l1c28/view?usp=sharing 下不来。我自己从kaggle下了404301对train。将kaggle提供的约40万带标签数据随机分为新的train/dev/test,train: 370011条,dev:20000条,test:14290条。preprocess后,运行bert_quora.py报错。 在第一次validate dev时出错,附上我的dev dev.tsv.zip
可以麻烦您提供一下您使用的quora原始train/dev/test数据吗?但可能不是这个数据的问题,代码在model_bert.py
premises_lengths = premises_mask.sum(dim=-1).long()
就出问题了,debug显示premises_lengths.shape
为torch.Size([])