superphy / prairiedog

next-gen pangenome graphs for predictive genomics
Other
0 stars 0 forks source link

Try Dgraph Bulk Loader #109

Closed kevinkle closed 4 years ago

kevinkle commented 5 years ago

Sub-issue of #106

kevinkle commented 5 years ago

Using binary from https://github.com/dgraph-io/dgraph/releases/download/v1.0.15/dgraph-linux-amd64.tar.gz

kevinkle commented 5 years ago

dgraph bulk -r outputs/samples/ -s dgraph/kmers.schema --map_shards=1 --reduce_shards=1 --http localhost:8001 --zero=localhost:5080

kevinkle commented 5 years ago

REDUCE 33m38s [99.73%] edge_count:390.6M edge_speed:480.3k/sec plist_count:7.237M plist_speed:8.900k/sec
REDUCE 33m39s [99.92%] edge_count:391.3M edge_speed:480.6k/sec plist_count:7.251M plist_speed:8.906k/sec
badger 2019/07/08 11:53:02 INFO: Storing value log head: {Fid:7 Len:42 Offset:53101009}
REDUCE 33m40s [100.00%] edge_count:391.6M edge_speed:480.4k/sec plist_count:7.258M plist_speed:8.903k/sec
REDUCE 33m41s [100.00%] edge_count:391.6M edge_speed:479.8k/sec plist_count:7.258M plist_speed:8.892k/sec
REDUCE 33m42s [100.00%] edge_count:391.6M edge_speed:479.2k/sec plist_count:7.258M plist_speed:8.882k/sec
badger 2019/07/08 11:53:05 INFO: Force compaction on level 0 done
REDUCE 33m43s [100.00%] edge_count:391.6M edge_speed:478.9k/sec plist_count:7.258M plist_speed:8.876k/sec
Total: 33m43s```
this is for 40 genomes
kevinkle commented 5 years ago
kevin@panther ~/prairiedog> ls -lah out/0/p/
total 2.4G
drwx------ 2 kevin kevin 4.0K Jul  8 11:53 ./
drwx------ 3 kevin kevin 4.0K Jul  8 11:19 ../
-rw-r--r-- 1 kevin kevin 491M Jul  8 11:41 000000.vlog
-rw-r--r-- 1 kevin kevin 491M Jul  8 11:43 000001.vlog
-rw-r--r-- 1 kevin kevin 491M Jul  8 11:45 000002.vlog
-rw-r--r-- 1 kevin kevin 414M Jul  8 11:47 000003.vlog
-rw-r--r-- 1 kevin kevin  81M Jul  8 11:48 000004.vlog
-rw-r--r-- 1 kevin kevin  81M Jul  8 11:50 000005.vlog
-rw-r--r-- 1 kevin kevin  70M Jul  8 11:49 000006.sst
-rw-r--r-- 1 kevin kevin  81M Jul  8 11:52 000006.vlog
-rw-r--r-- 1 kevin kevin  70M Jul  8 11:49 000007.sst
-rw-r--r-- 1 kevin kevin  51M Jul  8 11:53 000007.vlog
-rw-r--r-- 1 kevin kevin  70M Jul  8 11:53 000012.sst
-rw-r--r-- 1 kevin kevin  50M Jul  8 11:53 000013.sst
-rw-r--r-- 1 kevin kevin  212 Jul  8 11:53 MANIFEST
kevinkle commented 5 years ago

intermediate rdf files are kind of large

-rw-r--r-- 1 kevin kevin 678M Jul  8 15:17 SRR5573131.fasta.rdf
-rw-r--r-- 1 kevin kevin 676M Jul  8 15:21 SRR5573135.fasta.rdf
-rw-r--r-- 1 kevin kevin 686M Jul  8 14:46 SRR5573137.fasta.rdf
-rw-r--r-- 1 kevin kevin 670M Jul  8 16:34 SRR5573138.fasta.rdf
-rw-r--r-- 1 kevin kevin 667M Jul  8 16:42 SRR5573139.fasta.rdf
-rw-r--r-- 1 kevin kevin 679M Jul  8 15:05 SRR5573142.fasta.rdf
-rw-r--r-- 1 kevin kevin 670M Jul  8 16:07 SRR5573145.fasta.rdf
kevinkle commented 5 years ago

Currently testing with 950 genomes

kevinkle commented 5 years ago

Need to map tmp/ of working directory when running bulk to larger disk

kevinkle commented 5 years ago

dgraph bulk deletes tmp/ before starting

kevinkle commented 4 years ago

Looks good, will go with this