mldbai / mldb

MLDB is the Machine Learning Database
http://mldb.ai
Apache License 2.0
661 stars 102 forks source link

Decompose selects #904

Closed jeremybarnes closed 5 years ago

jeremybarnes commented 5 years ago

The import.text procedure is very powerful due to its use of a SELECT statement to manipulate data as it is parsed. However, this SQL processing was leading to a lot of extra data manipulation which was slowing down the speed at which data can be imported.

This branch improves the speed of text imports by decomposing the select into sub-selects, which can then be optimized on a field by field basis:

This work can eventually be moved to a more generic optimization pass, but currently lives mostly in the text importer module.

In addition, a number of extra changes were made to reduce text import to tabular dataset bottlenecks:

The net effect is a major improvement (20% to 300%) in CSV import speed, especially across a higher-end server with many cores (the below benchmarks were run on a 32 physical / 64 logical core system).

Before:

NYC taxis dataset, all fields: imported 14092412 in 10.0634s at 1.40036M lines/second on 52.082 CPUs done 2524.01 megabytes at 250.808 megabytes/sec
Stocks dataset, one field: 45001782 in 23.5393s at 1.91177M lines/second on 31.0846 CPUs done 19575.1 megabytes at 831.592 megabytes/sec 
Stocks dataset, all fields: 45001782 in 69.8894s at 0.6439M lines/second on 59.9076 CPUs 19575.1 megabytes at 280.087 megabytes/sec
Airlines dataset, all fields: 152360026 in 31.758s at 4.79753M lines/second on 43.0687 CPUs 15634.9 megabytes at 492.311 megabytes/sec
Airlines dataset, excluding one: imported 152360026 in 105.662s at 1.44195M lines/second on 57.8989 CPUs 15634.9 megabytes at 147.97 megabytes/sec
Airlines dataset, one field: 152360026 in 27.194s at 5.60271M lines/second on 26.9019 CPUs 15634.9 megabytes at 574.936 megabytes/sec

After:

NYC taxis dataset, all fields: 14092414 in 4.99944s at 2.8188M lines/second on 49.1943 CPUs 2524.01 megabytes at 504.847 megabytes/sec (2x faster)
Stocks dataset, one field: 45001786 in 9.96057s at 4.51799M lines/second on 45.0521 CPUs 19575.1 megabytes at 1965.24 megabytes/sec (2.4x faster)
Stocks dataset, all fields: 45001786 in 24.9354s at 1.80474M lines/second on 58.6538 CPUs 19575.1 megabytes at 785.032 megabytes/sec (2.8x faster)
Airlines dataset, all fields: 152360031 in 26.8021s at 5.68463M lines/second on 53.4331 CPUs 15634.9 megabytes at 583.344 megabytes/sec (1.2x faster)
Airlines dataset, excluding one: 152360031 in 28.8749s at 5.27656M lines/second on 54.8971 CPUs 15634.9 megabytes at 541.468 megabytes/sec (3.6x faster)
Airlines dataset, one field: 152360031 in 17.9381s at 8.49365M lines/second on 21.7854 CPUs 15634.9 megabytes at 871.599 megabytes/sec (1.5x faster)