mediachain / concat

Mediachain daemons
MIT License
42 stars 13 forks source link

Capacity test #48

Open parkan opened 7 years ago

parkan commented 7 years ago

Load 50-100MM data blobs + statements, observe performance curves.

Possibly combine with #32 and measure some random reads at each 10MM to make a nice graph?

vyzo commented 7 years ago

1mm ingestion on an m3.2xlarge instance:

$ time aleph/scripts/ingest-parallel.py aleph/scripts/publish.sh ingest/1m_split/
...
real    7m30.584s
user    47m16.214s
sys     2m3.685s

$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
1.2G    /mnt/ssd1/stmt
1.2G    /mnt/ssd2/data
vyzo commented 7 years ago

It should be noted that based on the top outlook we could have room for significant ingestion performance improvement by optimizing the slow json->cbor conversion.

parkan commented 7 years ago

Seems good enough so far, postponing this

vyzo commented 7 years ago

3.739mm DPLA objects:

$ wc -l dpla.manifest 
3739 dpla.manifest

$ time ~/aleph/scripts/ingest-parallel.py ~/aleph/scripts/publish-dpla.sh dpla
...
real    18m20.574s
user    107m33.339s
sys     1m16.588s

$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
4.1G    /mnt/ssd1/stmt
3.2G    /mnt/ssd2/data

A cool 3397 writes/s just by using better batching.

vyzo commented 7 years ago

An update on total space consumption, with DPLA, 500px, and pexels (4.6MM objects):

$ du --si /mnt/ssd1/stmt /mnt/ssd2/data/
4.9G    /mnt/ssd1/stmt
3.7G    /mnt/ssd2/data/

Seems that the content filtering has trimmed down object space consumption

vyzo commented 7 years ago

Perf measurement from flickr 4mm batch:

real    13m23.649s
user    72m52.314s
sys     0m46.557s

which comes out at 4977 writes/s.

vyzo commented 7 years ago

Space consumption from 60M Flickr dataset:

$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
61G   /mnt/ssd1/stmt
22G   /mnt/ssd2/data
vyzo commented 7 years ago

Space consumption with compound statements/100 objects per statement:

$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
19G     /mnt/ssd1/stmt
22G     /mnt/ssd2/data

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/xvdc        74G   21G   50G  29% /mnt/ssd2
/dev/xvdb        74G   18G   53G  26% /mnt/ssd1

which gives us a capacity estimate of at least 180mm objects for an m3.2xlarge node.

vyzo commented 7 years ago

Updated measurements with the inclusion of schema dependencies:

$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
22G     /mnt/ssd1/stmt
23G     /mnt/ssd2/data

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/xvdc        74G   22G   49G  31% /mnt/ssd2
/dev/xvdb        74G   21G   50G  29% /mnt/ssd1
vyzo commented 7 years ago

Removed the milestone as this will be an ongoing research topic as we grow the network.