hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
966 stars 242 forks source link

Clarifying Ordering And Keying #6929

Closed danking closed 5 years ago

danking commented 5 years ago

Background reading: "Ordering and keys in relational objects"

Background

There are three keying operations in hail: Table.key_by, MatrixTable.key_rows_by, and MatrixTable.key_cols_by.

There are two ordering operations in hail: Table.order_by, and MatrixTable.choose_cols.

Of the three keying operations, only Table.key_by and MatrixTable.key_rows_by enforce ordering. The ordering of the columns of a matrix table has no relationship to the keys of the columns of a MatrixTable. NB: when a table is created from the columns of a matrix table (MatrixTable.cols), the table is sorted by the keys (there exists an implicit Table.key_by). Moreover, we guarantee and document (in MatrixTable.cols) that when the matrix table has a zero-length column key, the table's ordering is given by matrix table columns' ordering.

According to "Ordering and keys in relational objects", all sorts (whether triggered by an order_by or a key_by) are unstable.

A common user operation is to export or collect a field of a relational object. Sometimes users do not want the keys of an expression exported or collected. In this situation, the user requires that the data is sorted in a sensible way (otherwise they cannot recover which item came from which key).

Hail internally guarantees (but does not guarantee to our users or document) that localizing operations (take, collect, and show) and export produce data in the ordering of the relational object. For example:

In [38]: t = hl.utils.range_table(3) 
    ...: t = t.order_by(-t.idx).show()                                                                                                                                                         
+-------+
|   idx |
+-------+
| int32 |
+-------+
|     2 |
|     1 |
|     0 |
+-------+

Ordering and Library Developers

On occasion, a user may have a table of unknown ordering and keying. For example, the implementor of Expression.collect (e.g. mt.GT.collect()). In this situation, it is desirable to be able to remove keys without modifying the order. In particular, the values should appear in the same order that they appear in the relational object (for a table, in the order of the rows; for a matrix table, ordered first by the row and then by the column).

For example, the multiplication table for 0 to 2:

In [39]: import hail as hl  
    ...: mt = hl.utils.range_matrix_table(3,3)  
    ...: mt = mt.annotate_entries(x = mt.row_idx * mt.col_idx)  
    ...: mt.x.collect() 
    ...:                                                                                                                                                                                       
Out[39]: [0, 0, 0, 0, 1, 2, 0, 2, 4]

If we choose to rekey the rows, the output is reordered:

In [41]: import hail as hl  
    ...: mt = hl.utils.range_matrix_table(3,3) 
    ...: mt = mt.annotate_entries(x = mt.row_idx * mt.col_idx)  
    ...: mt = mt.key_rows_by(rowkey=-mt.row_idx) 
    ...: mt.x.collect() 
    ...:                                                                                                                                                                                       
Out[41]: [0, 2, 4, 0, 1, 2, 0, 0, 0]

If we choose to reorder the cols:

In [43]: import hail as hl  
    ...: mt = hl.utils.range_matrix_table(3,3) 
    ...: mt = mt.annotate_entries(x = mt.row_idx * mt.col_idx)  
    ...: mt = mt.choose_cols([2,1,0]) 
    ...: mt.x.collect() 
    ...:                                                                                                                                                                                       
Out[43]: [0, 0, 0, 2, 1, 0, 4, 2, 0]

A library developer may want to remove keys while preserving order so as to implement the above methods. Because all sorts in Hail are unstable, this is a delicate feat. There are two cases: zero-length key, non-zero-length key.

When the key is of zero-length, the data may be sorted in some unknown and arbitrary order. Consider for example:

In [45]: import hail as hl  
    ...: mt = hl.utils.range_matrix_table(3,3) 
    ...: mt = mt.key_cols_by().choose_cols([1,2,0]) 
    ...: mt.cols().collect() 
    ...:                                                                                                                                                                                       
Out[45]: [Struct(col_idx=1), Struct(col_idx=2), Struct(col_idx=0)]

Or importing data with no key.

In this case it is crucial to not call order_by() or key_by() because both permit hail to arbitrarily reorder the entire dataset (we are unstably sorting by an empty key, ergo, all values are equal).

When the key is of non-zero-length, then data must be sorted by the key and the ordering of rows with equivalent keys is undefined. In this case, it is safe to order_by(*t.key).

User Expectations

There is a subtle difference in how we treat zero-length keyed (ZLK) objects and non-zero-length keyed objects. We try to preserve the data ordering of ZLK objects. In particular, users would be pretty surprised if import_table(..., key=[]) shuffled the rows. Additionally, users expect (and we document) that mt.key_cols_by().cols() does not shuffle the rows.

In contrast, we treat a non-zero-length keyed object as if any ordering beyond that defined by the key is irrelevant. Is this surprising to a user? Suppose they had a text file ordered by family id, sample id, they might reasonably expect that import_table(..., key=['family id']) does not reorder the rows within a family. Hail doesn't guarantee this even though we do make pains to not reorder the same data imported by import_table(..., key=[]).

Ordering and the Optimizer

Currently, the Hail optimizer does not remove a choose_cols that precedes a order_by(). Nor does it remove an order_by(x, ...) that precedes an order_by(). It does, however, remove a key_by(x, ...) that precedes an order_by() or a key_by().

In [47]: import hail as hl  
    ...: mt = hl.utils.range_matrix_table(3,3) 
    ...: mt = mt.key_cols_by().choose_cols([1,2,0]) 
    ...: mt.cols().order_by().show() 
    ...:                                                                                                                                                                                       
+---------+
| col_idx |
+---------+
|   int32 |
+---------+
|       1 |
|       2 |
|       0 |
+---------+

In [48]: t = hl.utils.range_table(3) 
    ...: t = t.order_by(-t.idx) 
    ...: t = t.order_by().show()                                                                                                                                                               
+-------+
|   idx |
+-------+
| int32 |
+-------+
|     2 |
|     1 |
|     0 |
+-------+

In [49]: t = hl.utils.range_table(3) 
    ...: t = t.key_by(x=-t.idx) 
    ...: t = t.order_by().show()                                                                                                                                                               
+-------+-------+
|   idx |     x |
+-------+-------+
| int32 | int32 |
+-------+-------+
|     0 |     0 |
|     1 |    -1 |
|     2 |    -2 |
+-------+-------+

In [50]: t = hl.utils.range_table(3) 
    ...: t = t.key_by(x=-t.idx) 
    ...: t = t.key_by().show()                                                                                                                                                                 
+-------+-------+
|   idx |     x |
+-------+-------+
| int32 | int32 |
+-------+-------+
|     0 |     0 |
|     1 |    -1 |
|     2 |    -2 |
+-------+-------+

Dealing With It

In practice, this following will remove the key on a table without changing the ordering imposed by previous order-changing operations. (NB: "latent" ordering inherited from a file [see aforementioned family id, sample id example] is not guaranteed to be preserved by this though, in practice, it often is)

def unkey(t):
    if len(t.key) != 0:
        t = t.order_by(t.key)
    return t
danking commented 5 years ago

There's no real issue here, so closing.