Evaluate distributed file systems

gqqnbig commented 2 years ago

[x] Glusterfs
[ ] Ceph
[ ] EOS
[ ] Lustre
[x] seaweedfs
[ ] Hadoop FUSE
[x] 星际文件系统

https://mdpi-res.com/d_attachment/electronics/electronics-10-01471/article_deploy/electronics-10-01471-v2.pdf

https://techcommunity.microsoft.com/t5/azure-global/benchmarking-goodness-comparing-lustre-glusterfs-and-beegfs-on/ba-p/1247881

Attachment: PVFS on Azure Guide.pdf

gqqnbig commented 2 years ago

Lustre is poor in sequential write result.

gqqnbig commented 2 years ago

Replica 2 suffers from split brain, but we don't have storage to offer replica 3.

gqqnbig commented 2 years ago

GlusterFS supports quota https://docs.gluster.org/en/v3/Administrator%20Guide/Directory%20Quota/

gqqnbig commented 2 years ago

on aloha

$ dd if=/dev/zero of=/home/qiqig/shared/speed bs=8k count=100k
102400+0 records in
102400+0 records out
838860800 bytes (839 MB, 800 MiB) copied, 9.32971 s, 89.9 MB/s
$ dd if=/dev/zero of=/home/qiqig/speed bs=8k count=100k
102400+0 records in
102400+0 records out
838860800 bytes (839 MB, 800 MiB) copied, 1.32384 s, 634 MB/s
$ dd if=/dev/zero of=/home/shared-la/qiqig/a bs=8k count=100k
102400+0 records in
102400+0 records out
838860800 bytes (839 MB, 800 MiB) copied, 12.612 s, 66.5 MB/s

aloha上本地硬盘的写入速度是634 MB/s。~/shared (nfs)的写入速度是89.9 MB/s。la (glusterfs, distributed)的写入速度是66.5 MB/s。

gqqnbig commented 2 years ago

glusterfs distributed volumne with default settings

7z x diffs-java.7z  274.88s user 588.02s system 7% cpu 3:04:31.67 total
Initialized empty Git repository in /home/shared-la/qiqig/diffs-java/.git/
git add .
git add .  1118.73s user 895.48s system 7% cpu 6:59:53.93 total
git commit
Auto packing the repository in background for optimum performance.
See "git help gc" for manual housekeeping.
[master (root-commit) 4add7c7] init
 1365556 files changed, 730981442 insertions(+)
git commit -m 'init'  377.36s user 534.94s system 4% cpu 5:08:43.38 total

"GlusterFS is terrible at large numbers of small files. " https://serverfault.com/questions/627139/why-is-glusterfs-so-slow-here

glusterfs distributed volumne with Performance tuning

worse than the default

7z x diffs-java.7z  277.38s user 545.36s system 6% cpu 3:18:00.36 total
Initialized empty Git repository in /home/shared-la/qiqig/diffs-java/.git/
git add .
git add .  1124.03s user 870.04s system 5% cpu 9:28:26.82 total
git commit
[master (root-commit) d1b580e] init
 1365556 files changed, 730981442 insertions(+)
git commit -m 'init'  370.67s user 496.13s system 3% cpu 6:55:30.37 total

reset all settings back to default:

sudo gluster volume reset gv0 all

nfs

7z x diffs-java.7z  182.49s user 514.84s system 17% cpu 1:04:46.74 total
Initialized empty Git repository in /home/shared/qiqig/diffs-java/.git/
git add .
git add .  865.94s user 658.07s system 24% cpu 1:44:08.67 total
git commit
[master (root-commit) 0041d0b] init
 1365556 files changed, 730981442 insertions(+)
git commit -m 'init'  274.19s user 411.37s system 25% cpu 44:43.75 total

local disk

7z x diffs-java.7z  139.49s user 124.49s system 162% cpu 2:42.53 total
Initialized empty Git repository in /home/qiqig/diffs-java/.git/
git add .
git add .  770.70s user 221.56s system 61% cpu 26:46.70 total
git commit
[master (root-commit) 310761d] init
 1365556 files changed, 730981442 insertions(+)
git commit -m 'init'  219.06s user 138.74s system 47% cpu 12:41.26 total

gqqnbig commented 2 years ago

https://www.beegfs.io has quota enforcement, but it's a commercial feature.

gqqnbig commented 2 years ago

The architecture of EOS https://eos-docs.web.cern.ch/quickstart/docker_image.html is complicated in my opinion. It requires storage servers (FST), 1 namespace server (MGM), 1 messaging borker (MQ).

Lu-233 commented 2 years ago

speed test for shared-la

受邀测试了机器学习常见的几种dataloader，总的来说小文件读性能差一些，但可以通过提升读数据线程提升性能。不过Test 3的情况很奇怪。

Test 1 CIFAR100 loader: 使用torchvision的CIFAR100迭代所有数据，bs=256 Test 2 CIFAR100 batch write: 在Test1基础上通过torch.save将每个batch的数据写入文件，单线程 Test 3 CIFAR100 batch read: 读取Test2写入的数据，通过torch.load，单线程 Test 4 Img file dataloader: 基于图片文件夹的数据集，使用few-shot setting的mini-imagenet, 100个batch。

item	shared	la
Test 1 CIFAR100 loader 1 worker	16.13s	16s
Test 1 CIFAR100 loader 4 wirker	6.6s	6.6s
Test 2 CIFAR100 batch write	10.5s	14.2s
Test 3 CIFAR100 batch read	1.15s	10.7s
Test 4 Img file dataloader 1 worker	64.4s	96.0s
Test 4 Img file dataloader 4 worker	19.5s	30.5s
Test 4 Img file dataloader 6 worker	15.6s	24.2s

意料之外的测试：Test 3 CIFAR100 batch read

写入后第一次读取耗时很长，第二次就恢复到正常水平。

第一次读取
sh 1.3923368453979492s
la 10.463183164596558s
第二次读取
sh 0.995013952255249s
la 0.8016483783721924s
第三次读取
sh 0.7495768070220947s
la 0.7286818027496338s

重新写入不同的文件和文件名
第一次读取
sh 1.4375836849212646s
la 10.94768238067627s
第二次读取
sh 1.0324783325195312s
la 0.7871301174163818s
第三次读取
sh 0.7590959072113037s
la 0.7243919372558594s

Lu-233 commented 2 years ago

补充：

测试是在aloha上完成的。因为aha关闭登录，其他节点没有shared-la目录。

虽然增加工作线程可以降低读数据时间，但要使用更多的CPU，这可能会导致用户尝试使用更多的CPU。

在比如在tatooine（20cpu），如果前两个申请者各使用1GPU+8CPU，后续使用者就没有足够的CPU。

aloha上有6个CPU，我尝试使用超过CPU数量的dataloader，在Test 4上获得了更好的性能。

item	shared	la
Test 4 img file dataloader 6 worker	15.6s	24.2s
Test 4 img file dataloader 8 worker	14.7s	21.8s
Test 4 img file dataloader 10 worker	14.7s	19.2s
Test 4 img file dataloader 12 worker	14.8s	19.0s

在做完测试后20分钟，我再次测试Test 3 CIFAR100 batch read	item	shared	la
Test 3 CIFAR100 batch read 第一次	1.0s	16.9s
Test 3 CIFAR100 batch read 第二次	0.77s	0.76s

第二次之后是缓存？这种两个数量级的性能差距对于实际使用是难以接受的。

luoyuqi-lab commented 2 years ago

用numpy和pillow测试。在aloha上。

数据集大小300个文件夹，每个文件夹有若干.npy文件和.png文件，均是小文件。

Test1. 迭代保存所有numpy信息。依次运行三次。

np.load()

shared	shared-la
2.468s	8.725s
0.910s	8.606s
0.903s	8.636s

Test2. 附加使用pillow遍历一次png图片的像素点array，依次运行五次。

Image.open() img.load()

shared	shared-la
6.357s	15.606s
4.923s	14.308s
3.141s	14.332s
3.185s	14.265s
3.197s	14.378s

Without deep learning test because I have no completed codes and envs.

gqqnbig commented 2 years ago

File system switched to seaweedfs.

Note: seaweedfs will compress files. By using POSIX FUSE mount point, you cannot disable compression. You write a 80GB file with all 0s, the storage may only grow 1MB. CPU may or may not be wasted trying to compress already compressed files.

gqqnbig commented 2 years ago

dd if=/dev/zero of=/home/shared-la/qiqig/a bs=8k count=100k
102400+0 records in
102400+0 records out
838860800 bytes (839 MB, 800 MiB) copied, 6.08959 s, 138 MB/s

gqqnbig commented 2 years ago

Seaweedfs is easy to set up. Its speed feels like fast, but may throw IO errors.

Albeit promising, it lacks documentation, popularity, and community support. https://github.com/chrislusf/seaweedfs/discussions?discussions_q=author%3Agqqnbig

It's hard to set up ownership and permissions correctly.

FUSE mount point has no control over compression.

Verdict: Reject

gqqnbig commented 2 years ago

InterPlanetary File System (星际文件协议)

2022-06-01_05-12-27 IPFS_Desktop_2022-06-01_05-11-30

Interesting software. It requires a local repository to cache remote files. The local folder quickly goes to 10GB in a few minutes after installation. It runs garbage collection though.

For Shine cluster, compute nodes with think storage may not run the IPFS client because it doesn't have extra tens of gigabytes. Moreover, FUSE support is absolutely experimental.

gqqnbig commented 2 years ago

Lustre runs on Red hat or CentOS. It's unclear if building from source is required on Ubuntu. Resources are scarce. Word has it that Linux kernel has to be modified, but we don't know why the current kernel doesn't work.

https://wiki.whamcloud.com/display/PUB/Build+Lustre+MASTER+with+Ldiskfs+on+Ubuntu+20.04.1+from+Git

gqqnbig commented 2 years ago

@luoyuqi-lab @wulamao

如果你们想要接手继续测试这些distributed file systems，请把这个issue assign给自己（并移除我）。

实际上我们只测试了glusterfs，我觉得性能不太够。你们觉得呢？

如果没有人进一步测试，且你们也觉得glusterfs不够好的话，那考虑到目前的磁盘紧张情况，bespin的存储空间就直接以nfs协议放出。目前的~/shared就是以nfs协议放出的。若如此做，新的存储空间就没有high availablity。

lyulyul / shine-cluster