Closed jianqiao closed 8 years ago
Adding TupleVectorValueAccessor
caused Travis-CI to run out of its 4GB memory for compiling 3 of the 8 configurations, as well as increased the binary size by ~20%. So it might be not worthing adding this new TupleVectorValueAccessor
at this moment which is only used by TextScanOperator
.
Will send out a new PR using the existing ColumnVectorsValueAccessor
for bulk inserting tuples for TextScanOperator
.
This PR updates the
TextScanOperator
to improve its performance.There are four main changes: (1) Pass
text_offset
andtext_segment_size
as parameters to eachTextScanWorkOrder
instead of really loading the data. Then eachTextScanWorkOrder
reads the corresponding piece of data directly from disk. (2) Avoid extra string copying by passingconst char **
buffer pointers intoparseRow()
andextractFieldString()
. (3) ImplementTupleVectorValueAccessor
as the temporary container to store the parsed tuples. Then calloutput_destination_->bulkInsertTuples()
to bulk insert the tuples. (4) ModifiedCharType::parseValueFromString()
to create aTypedValue
that has its buffer exactly the length as specified by theCharType
. This is required forTupleVectorValueAccessor
to work correctly and also for robustness consideration.Note: This updated version follows the semantics of the old
TextScanOperator
except that it does not support the backslash + newline escaping, e.g. (a)which is semantically equivalent to (b)
We support (b) but not (a). As (a) incurs extra logic that complicates code. Meanwhile, format (a) seems to be specific to PostgreSQL, and the documentation of PostgreSQL 9.6 says: It is strongly recommended that applications generating COPY data convert data newlines and carriage returns to the \n and \r sequences respectively. At present it is possible to represent a data carriage return by a backslash and carriage return, and to represent a data newline by a backslash and newline. However, these representations might not be accepted in future releases. They are also highly vulnerable to corruption if the COPY file is transferred across different machines (for example, from Unix to Windows or vice versa).