vmware-archive / quickstep

Quickstep Project
Apache License 2.0
27 stars 13 forks source link

Improve text scan operator #239

Closed jianqiao closed 8 years ago

jianqiao commented 8 years ago

This PR updates the TextScanOperator to improve its performance.

There are four main changes: (1) Pass text_offset and text_segment_size as parameters to each TextScanWorkOrder instead of really loading the data. Then each TextScanWorkOrder reads the corresponding piece of data directly from disk. (2) Avoid extra string copying by passing const char ** buffer pointers into parseRow() and extractFieldString(). (3) Implement TupleVectorValueAccessor as the temporary container to store the parsed tuples. Then call output_destination_->bulkInsertTuples() to bulk insert the tuples. (4) Modified CharType::parseValueFromString() to create a TypedValue that has its buffer exactly the length as specified by the CharType. This is required for TupleVectorValueAccessor to work correctly and also for robustness consideration.

Note: This updated version follows the semantics of the old TextScanOperator except that it does not support the backslash + newline escaping, e.g. (a)

aaaa\
bbbb

which is semantically equivalent to (b)

aaaa\nbbbb

We support (b) but not (a). As (a) incurs extra logic that complicates code. Meanwhile, format (a) seems to be specific to PostgreSQL, and the documentation of PostgreSQL 9.6 says: It is strongly recommended that applications generating COPY data convert data newlines and carriage returns to the \n and \r sequences respectively. At present it is possible to represent a data carriage return by a backslash and carriage return, and to represent a data newline by a backslash and newline. However, these representations might not be accepted in future releases. They are also highly vulnerable to corruption if the COPY file is transferred across different machines (for example, from Unix to Windows or vice versa).

jianqiao commented 8 years ago

Adding TupleVectorValueAccessor caused Travis-CI to run out of its 4GB memory for compiling 3 of the 8 configurations, as well as increased the binary size by ~20%. So it might be not worthing adding this new TupleVectorValueAccessor at this moment which is only used by TextScanOperator.

Will send out a new PR using the existing ColumnVectorsValueAccessor for bulk inserting tuples for TextScanOperator.