The import.text procedure is very powerful due to its use of a SELECT statement to manipulate data as it is parsed. However, this SQL processing was leading to a lot of extra data manipulation which was slowing down the speed at which data can be imported.
This branch improves the speed of text imports by decomposing the select into sub-selects, which can then be optimized on a field by field basis:
Fields in the input file which are not used are not fully parsed, with just enough work done to skip them.
Fields which are simply copied or renamed into the output skip the generic select processing
Values are moved instead of copied the last time they are accessed
This work can eventually be moved to a more generic optimization pass, but currently lives mostly in the text importer module.
In addition, a number of extra changes were made to reduce text import to tabular dataset bottlenecks:
Less allocations are made when indexing columns that contain a high diversity of values
Memory mapping is used for files that are uncompressed on the local filesystem
The system thread pool is used for text parsing, rather than a separate thread pool that caused scheduling contention
Rather than using a regex to skip comment or other information preceding the file header, there is now a preHeaderOffset setting that will skip a given number of lines (regex processing is a bottleneck, even for simple regexs).
The net effect is a major improvement (20% to 300%) in CSV import speed, especially across a higher-end server with many cores (the below benchmarks were run on a 32 physical / 64 logical core system).
Before:
NYC taxis dataset, all fields: imported 14092412 in 10.0634s at 1.40036M lines/second on 52.082 CPUs done 2524.01 megabytes at 250.808 megabytes/sec
Stocks dataset, one field: 45001782 in 23.5393s at 1.91177M lines/second on 31.0846 CPUs done 19575.1 megabytes at 831.592 megabytes/sec
Stocks dataset, all fields: 45001782 in 69.8894s at 0.6439M lines/second on 59.9076 CPUs 19575.1 megabytes at 280.087 megabytes/sec
Airlines dataset, all fields: 152360026 in 31.758s at 4.79753M lines/second on 43.0687 CPUs 15634.9 megabytes at 492.311 megabytes/sec
Airlines dataset, excluding one: imported 152360026 in 105.662s at 1.44195M lines/second on 57.8989 CPUs 15634.9 megabytes at 147.97 megabytes/sec
Airlines dataset, one field: 152360026 in 27.194s at 5.60271M lines/second on 26.9019 CPUs 15634.9 megabytes at 574.936 megabytes/sec
After:
NYC taxis dataset, all fields: 14092414 in 4.99944s at 2.8188M lines/second on 49.1943 CPUs 2524.01 megabytes at 504.847 megabytes/sec (2x faster)
Stocks dataset, one field: 45001786 in 9.96057s at 4.51799M lines/second on 45.0521 CPUs 19575.1 megabytes at 1965.24 megabytes/sec (2.4x faster)
Stocks dataset, all fields: 45001786 in 24.9354s at 1.80474M lines/second on 58.6538 CPUs 19575.1 megabytes at 785.032 megabytes/sec (2.8x faster)
Airlines dataset, all fields: 152360031 in 26.8021s at 5.68463M lines/second on 53.4331 CPUs 15634.9 megabytes at 583.344 megabytes/sec (1.2x faster)
Airlines dataset, excluding one: 152360031 in 28.8749s at 5.27656M lines/second on 54.8971 CPUs 15634.9 megabytes at 541.468 megabytes/sec (3.6x faster)
Airlines dataset, one field: 152360031 in 17.9381s at 8.49365M lines/second on 21.7854 CPUs 15634.9 megabytes at 871.599 megabytes/sec (1.5x faster)
The
import.text
procedure is very powerful due to its use of aSELECT
statement to manipulate data as it is parsed. However, this SQL processing was leading to a lot of extra data manipulation which was slowing down the speed at which data can be imported.This branch improves the speed of text imports by decomposing the select into sub-selects, which can then be optimized on a field by field basis:
This work can eventually be moved to a more generic optimization pass, but currently lives mostly in the text importer module.
In addition, a number of extra changes were made to reduce text import to
tabular
dataset bottlenecks:preHeaderOffset
setting that will skip a given number of lines (regex processing is a bottleneck, even for simple regexs).The net effect is a major improvement (20% to 300%) in CSV import speed, especially across a higher-end server with many cores (the below benchmarks were run on a 32 physical / 64 logical core system).
Before:
After: