linjieli222 / HERO

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
https://arxiv.org/abs/2005.00200
MIT License
230 stars 34 forks source link

"lmdb.LockError: mdb_txn_begin" when using network file system. #21

Closed becxer closed 3 years ago

becxer commented 3 years ago

We are trying to reproduce the results with the same settings including the GPU number. However, we are struggling with horovod settings. While loading the dataset from lmdb, it shows, again and again, the following error. "lmdb.LockError: mdb_txn_begin"

So, we searched and in StackOverflow we found the following answer, and the answer points out LMDB is not fitting with a network file system (NFS).

https://stackoverflow.com/questions/61365680/lmdb-error-lmdb-lockerror-mdb-txn-begin-resource-temporarily-unavaliable

So, my question is how authors have trained the model using 16GPU (or more than 16) with NFS? If it's not, also being curious, how authors trained using horovod with non-network file system?

or is there any alternative solving way for this problem?

linjieli222 commented 3 years ago

Hi there,

Thanks for your interests in our project. I am afraid that this is a hard question to answer given my limited knowledge about NFS. Our experiments are mostly done on Azure VMs with premium SSD Managed Disks. The large pre-training experiment on HowTo100M is done on DGX-2 machine with 16 x 32GB V100 GPUs. All the data released are compatible with the experiments performed on these machines. And we did not observe the same error.

Are you only observing the error for pre-training, or both pre-training and finetuning?

becxer commented 3 years ago

Thanks for your quick response. 👍

We are facing those errors both pre-training and finetuning. I guess in my case, it may be better to move from LMDB to just normal JSON format for using horovod. :(

Unified-Robots commented 3 years ago

@becxer I faced those errors once, and I think the lock.mdb caused the above problem. Have your tried to remove the corresponding lock.mdb file.

becxer commented 3 years ago

@Unified-Robots lock.mdb just is recreated by accessing lmdb. How many Nodes and GPUs are you using? Currently we are trying this on 16 GPUs with 16 Nodes (1gpu per 1 node)

Unified-Robots commented 3 years ago

@becxer I'm using 8 GPUS with 8 Nodes.

linjieli222 commented 3 years ago

Closed due to inactivity.