Open parkan opened 7 years ago
1mm ingestion on an m3.2xlarge instance:
$ time aleph/scripts/ingest-parallel.py aleph/scripts/publish.sh ingest/1m_split/
...
real 7m30.584s
user 47m16.214s
sys 2m3.685s
$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
1.2G /mnt/ssd1/stmt
1.2G /mnt/ssd2/data
It should be noted that based on the top outlook we could have room for significant ingestion performance improvement by optimizing the slow json->cbor conversion.
Seems good enough so far, postponing this
3.739mm DPLA objects:
$ wc -l dpla.manifest
3739 dpla.manifest
$ time ~/aleph/scripts/ingest-parallel.py ~/aleph/scripts/publish-dpla.sh dpla
...
real 18m20.574s
user 107m33.339s
sys 1m16.588s
$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
4.1G /mnt/ssd1/stmt
3.2G /mnt/ssd2/data
A cool 3397 writes/s just by using better batching.
An update on total space consumption, with DPLA, 500px, and pexels (4.6MM objects):
$ du --si /mnt/ssd1/stmt /mnt/ssd2/data/
4.9G /mnt/ssd1/stmt
3.7G /mnt/ssd2/data/
Seems that the content filtering has trimmed down object space consumption
Perf measurement from flickr 4mm batch:
real 13m23.649s
user 72m52.314s
sys 0m46.557s
which comes out at 4977 writes/s.
Space consumption from 60M Flickr dataset:
$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
61G /mnt/ssd1/stmt
22G /mnt/ssd2/data
Space consumption with compound statements/100 objects per statement:
$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
19G /mnt/ssd1/stmt
22G /mnt/ssd2/data
$ df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/xvdc 74G 21G 50G 29% /mnt/ssd2
/dev/xvdb 74G 18G 53G 26% /mnt/ssd1
which gives us a capacity estimate of at least 180mm objects for an m3.2xlarge node.
Updated measurements with the inclusion of schema dependencies:
$ du --si /mnt/ssd1/stmt /mnt/ssd2/data
22G /mnt/ssd1/stmt
23G /mnt/ssd2/data
$ df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/xvdc 74G 22G 49G 31% /mnt/ssd2
/dev/xvdb 74G 21G 50G 29% /mnt/ssd1
Removed the milestone as this will be an ongoing research topic as we grow the network.
Load 50-100MM data blobs + statements, observe performance curves.
Possibly combine with #32 and measure some random reads at each 10MM to make a nice graph?