Open GregoryKimball opened 4 months ago
Hello @shrshi, I added this issue about the refactoring work you started this week. Please excuse me if you documented this elsewhere and I missed it. Please feel free to update these topics with your current picture of the project. Thank you!
Tree representation:
This feature introduces a new column_tree_csr
struct that stores the column tree representation in CSR format. The nodes are renumbered level-wise instead of being directly mapped to column ids. This serves two purposes - (i) sub-trees matching input dtype schema can be skipped in between-column pruning, and (ii) sub-trees matching non-conforming dtypes in mixed type columns can be similarly skipped (within-column pruning).
The key advantage of wrapping column properties - mixed types and map types support, column pruning, ignoring null literals, column validity, and array of arrays support - as 'non-zero' values in column_tree_csr
struct is maintainability and ease of adding new features in the future.
The steady addition of features to the JSON reader has resulted in some code paths that are error-prone (see #15750) and difficult to maintain. Support for mixed types, coercing nested types to string, array of arrays, null literals and more has been added over the past few releases (see comment) and stretched the original design of token-to-tree and tree-to-column processing.