Open gqqnbig opened 2 years ago
Lustre is poor in sequential write result.
Replica 2 suffers from split brain, but we don't have storage to offer replica 3.
GlusterFS supports quota https://docs.gluster.org/en/v3/Administrator%20Guide/Directory%20Quota/
on aloha
$ dd if=/dev/zero of=/home/qiqig/shared/speed bs=8k count=100k
102400+0 records in
102400+0 records out
838860800 bytes (839 MB, 800 MiB) copied, 9.32971 s, 89.9 MB/s
$ dd if=/dev/zero of=/home/qiqig/speed bs=8k count=100k
102400+0 records in
102400+0 records out
838860800 bytes (839 MB, 800 MiB) copied, 1.32384 s, 634 MB/s
$ dd if=/dev/zero of=/home/shared-la/qiqig/a bs=8k count=100k
102400+0 records in
102400+0 records out
838860800 bytes (839 MB, 800 MiB) copied, 12.612 s, 66.5 MB/s
aloha上本地硬盘的写入速度是634 MB/s。~/shared (nfs)的写入速度是89.9 MB/s。la (glusterfs, distributed)的写入速度是66.5 MB/s。
7z x diffs-java.7z 274.88s user 588.02s system 7% cpu 3:04:31.67 total
Initialized empty Git repository in /home/shared-la/qiqig/diffs-java/.git/
git add .
git add . 1118.73s user 895.48s system 7% cpu 6:59:53.93 total
git commit
Auto packing the repository in background for optimum performance.
See "git help gc" for manual housekeeping.
[master (root-commit) 4add7c7] init
1365556 files changed, 730981442 insertions(+)
git commit -m 'init' 377.36s user 534.94s system 4% cpu 5:08:43.38 total
"GlusterFS is terrible at large numbers of small files. " https://serverfault.com/questions/627139/why-is-glusterfs-so-slow-here
worse than the default
7z x diffs-java.7z 277.38s user 545.36s system 6% cpu 3:18:00.36 total
Initialized empty Git repository in /home/shared-la/qiqig/diffs-java/.git/
git add .
git add . 1124.03s user 870.04s system 5% cpu 9:28:26.82 total
git commit
[master (root-commit) d1b580e] init
1365556 files changed, 730981442 insertions(+)
git commit -m 'init' 370.67s user 496.13s system 3% cpu 6:55:30.37 total
reset all settings back to default:
sudo gluster volume reset gv0 all
7z x diffs-java.7z 182.49s user 514.84s system 17% cpu 1:04:46.74 total
Initialized empty Git repository in /home/shared/qiqig/diffs-java/.git/
git add .
git add . 865.94s user 658.07s system 24% cpu 1:44:08.67 total
git commit
[master (root-commit) 0041d0b] init
1365556 files changed, 730981442 insertions(+)
git commit -m 'init' 274.19s user 411.37s system 25% cpu 44:43.75 total
7z x diffs-java.7z 139.49s user 124.49s system 162% cpu 2:42.53 total
Initialized empty Git repository in /home/qiqig/diffs-java/.git/
git add .
git add . 770.70s user 221.56s system 61% cpu 26:46.70 total
git commit
[master (root-commit) 310761d] init
1365556 files changed, 730981442 insertions(+)
git commit -m 'init' 219.06s user 138.74s system 47% cpu 12:41.26 total
https://www.beegfs.io has quota enforcement, but it's a commercial feature.
The architecture of EOS https://eos-docs.web.cern.ch/quickstart/docker_image.html is complicated in my opinion. It requires storage servers (FST), 1 namespace server (MGM), 1 messaging borker (MQ).
受邀测试了机器学习常见的几种dataloader,总的来说小文件读性能差一些,但可以通过提升读数据线程提升性能。不过Test 3的情况很奇怪。
Test 1 CIFAR100 loader: 使用torchvision的CIFAR100迭代所有数据,bs=256 Test 2 CIFAR100 batch write: 在Test1基础上通过torch.save将每个batch的数据写入文件,单线程 Test 3 CIFAR100 batch read: 读取Test2写入的数据,通过torch.load,单线程 Test 4 Img file dataloader: 基于图片文件夹的数据集,使用few-shot setting的mini-imagenet, 100个batch。
item | shared | la |
---|---|---|
Test 1 CIFAR100 loader 1 worker | 16.13s | 16s |
Test 1 CIFAR100 loader 4 wirker | 6.6s | 6.6s |
Test 2 CIFAR100 batch write | 10.5s | 14.2s |
Test 3 CIFAR100 batch read | 1.15s | 10.7s |
Test 4 Img file dataloader 1 worker | 64.4s | 96.0s |
Test 4 Img file dataloader 4 worker | 19.5s | 30.5s |
Test 4 Img file dataloader 6 worker | 15.6s | 24.2s |
意料之外的测试:Test 3 CIFAR100 batch read
写入后第一次读取耗时很长,第二次就恢复到正常水平。
第一次读取
sh 1.3923368453979492s
la 10.463183164596558s
第二次读取
sh 0.995013952255249s
la 0.8016483783721924s
第三次读取
sh 0.7495768070220947s
la 0.7286818027496338s
重新写入不同的文件和文件名
第一次读取
sh 1.4375836849212646s
la 10.94768238067627s
第二次读取
sh 1.0324783325195312s
la 0.7871301174163818s
第三次读取
sh 0.7590959072113037s
la 0.7243919372558594s
补充:
测试是在aloha上完成的。因为aha关闭登录,其他节点没有shared-la目录。
虽然增加工作线程可以降低读数据时间,但要使用更多的CPU,这可能会导致用户尝试使用更多的CPU。
在比如在tatooine(20cpu),如果前两个申请者各使用1GPU+8CPU,后续使用者就没有足够的CPU。
aloha上有6个CPU,我尝试使用超过CPU数量的dataloader,在Test 4上获得了更好的性能。
item | shared | la |
---|---|---|
Test 4 img file dataloader 6 worker | 15.6s | 24.2s |
Test 4 img file dataloader 8 worker | 14.7s | 21.8s |
Test 4 img file dataloader 10 worker | 14.7s | 19.2s |
Test 4 img file dataloader 12 worker | 14.8s | 19.0s |
在做完测试后20分钟,我再次测试Test 3 CIFAR100 batch read | item | shared | la |
---|---|---|---|
Test 3 CIFAR100 batch read 第一次 | 1.0s | 16.9s | |
Test 3 CIFAR100 batch read 第二次 | 0.77s | 0.76s |
第二次之后是缓存?这种两个数量级的性能差距对于实际使用是难以接受的。
用numpy和pillow测试。在aloha上。
数据集大小300个文件夹,每个文件夹有若干.npy文件和.png文件,均是小文件。
Test1. 迭代保存所有numpy信息。依次运行三次。
np.load()
shared | shared-la |
---|---|
2.468s | 8.725s |
0.910s | 8.606s |
0.903s | 8.636s |
Test2. 附加使用pillow遍历一次png图片的像素点array,依次运行五次。
Image.open()
img.load()
shared | shared-la |
---|---|
6.357s | 15.606s |
4.923s | 14.308s |
3.141s | 14.332s |
3.185s | 14.265s |
3.197s | 14.378s |
Without deep learning test because I have no completed codes and envs.
File system switched to seaweedfs.
Note: seaweedfs will compress files. By using POSIX FUSE mount point, you cannot disable compression. You write a 80GB file with all 0s, the storage may only grow 1MB. CPU may or may not be wasted trying to compress already compressed files.
dd if=/dev/zero of=/home/shared-la/qiqig/a bs=8k count=100k
102400+0 records in
102400+0 records out
838860800 bytes (839 MB, 800 MiB) copied, 6.08959 s, 138 MB/s
Seaweedfs is easy to set up. Its speed feels like fast, but may throw IO errors.
Albeit promising, it lacks documentation, popularity, and community support. https://github.com/chrislusf/seaweedfs/discussions?discussions_q=author%3Agqqnbig
It's hard to set up ownership and permissions correctly.
FUSE mount point has no control over compression.
Verdict: Reject
Interesting software. It requires a local repository to cache remote files. The local folder quickly goes to 10GB in a few minutes after installation. It runs garbage collection though.
For Shine cluster, compute nodes with think storage may not run the IPFS client because it doesn't have extra tens of gigabytes. Moreover, FUSE support is absolutely experimental.
Lustre runs on Red hat or CentOS. It's unclear if building from source is required on Ubuntu. Resources are scarce. Word has it that Linux kernel has to be modified, but we don't know why the current kernel doesn't work.
https://wiki.whamcloud.com/display/PUB/Build+Lustre+MASTER+with+Ldiskfs+on+Ubuntu+20.04.1+from+Git
@luoyuqi-lab @wulamao
如果你们想要接手继续测试这些distributed file systems,请把这个issue assign给自己(并移除我)。
实际上我们只测试了glusterfs,我觉得性能不太够。你们觉得呢?
如果没有人进一步测试,且你们也觉得glusterfs不够好的话,那考虑到目前的磁盘紧张情况,bespin的存储空间就直接以nfs协议放出。目前的~/shared就是以nfs协议放出的。若如此做,新的存储空间就没有high availablity。
https://mdpi-res.com/d_attachment/electronics/electronics-10-01471/article_deploy/electronics-10-01471-v2.pdf
Attachment: electronics-10-01471-v2.pdf
https://techcommunity.microsoft.com/t5/azure-global/benchmarking-goodness-comparing-lustre-glusterfs-and-beegfs-on/ba-p/1247881
Attachment: PVFS on Azure Guide.pdf