Closed becxer closed 3 years ago
Hi there,
Thanks for your interests in our project. I am afraid that this is a hard question to answer given my limited knowledge about NFS. Our experiments are mostly done on Azure VMs with premium SSD Managed Disks. The large pre-training experiment on HowTo100M is done on DGX-2 machine with 16 x 32GB V100 GPUs. All the data released are compatible with the experiments performed on these machines. And we did not observe the same error.
Are you only observing the error for pre-training, or both pre-training and finetuning?
Thanks for your quick response. 👍
We are facing those errors both pre-training and finetuning. I guess in my case, it may be better to move from LMDB to just normal JSON format for using horovod. :(
@becxer I faced those errors once, and I think the lock.mdb caused the above problem. Have your tried to remove the corresponding lock.mdb file.
@Unified-Robots lock.mdb just is recreated by accessing lmdb. How many Nodes and GPUs are you using? Currently we are trying this on 16 GPUs with 16 Nodes (1gpu per 1 node)
@becxer I'm using 8 GPUS with 8 Nodes.
Closed due to inactivity.
We are trying to reproduce the results with the same settings including the GPU number. However, we are struggling with horovod settings. While loading the dataset from lmdb, it shows, again and again, the following error. "lmdb.LockError: mdb_txn_begin"
So, we searched and in StackOverflow we found the following answer, and the answer points out LMDB is not fitting with a network file system (NFS).
https://stackoverflow.com/questions/61365680/lmdb-error-lmdb-lockerror-mdb-txn-begin-resource-temporarily-unavaliable
So, my question is how authors have trained the model using 16GPU (or more than 16) with NFS? If it's not, also being curious, how authors trained using horovod with non-network file system?
or is there any alternative solving way for this problem?