Attempt to free invalid pointer 0x7ffe44b66698

hemingkx / ChineseNMT

ChineseNMT: Translate English to Chinese with PyTorch Implementation of Transformer

439 stars 84 forks source link

Closed HankerWu closed 3 years ago

HankerWu commented 3 years ago

训练过程中，最开始遇到问题：

Error in `python': malloc(): memory corruption (fast): 0x00007ffe8a716f58

从网上找到的解决方案：

apt-get update apt-get install libtcmalloc-minimal4 export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

出现了如题所述的新问题：

src/tcmalloc.cc:278] Attempt to free invalid pointer 0x7ffe44b66698 /pai_bootstrap.sh: line 259: 342 Aborted (core dumped) python main.py

博客https://blog.csdn.net/qq_22194315/article/details/79673471的解决方案我也尝试过，依然会出现相同的问题。

hemingkx commented 3 years ago

我没有遇到这个问题哎，能否附上完整的包含报错信息的训练log？😊

HankerWu commented 3 years ago

新的问题的报错信息就只有这么多了，报错的位置也不固定，只能确定是该问题一定是在train的时候出现，而非eval。

至于一开始遇到的问题的报错信息挺长的，主要是追踪内存地址之类，我也看不太懂，太长了把它放进这个文件了： error_log.txt

hemingkx commented 3 years ago

是Linux吗？有没有可能是pytorch-gpu自己的问题呢？有测试过pytorch-gpu是否安装正确吗？可以通过如下指令测试：

import torch 
torch.cuda.is_available()

HankerWu commented 3 years ago

gpu是没问题的，有一定的概率可以正常训练完几个epoch，不过很多时候一个epoch都训练不了

hemingkx commented 3 years ago

是用Anaconda管理linux环境的吗？重建一个虚拟环境跑一下。