Closed tpoterba closed 5 years ago
While you're at it, can you remove entriesRVD warning if col key is empty?
The strictify is needed to ensure that the resulting entriesRVD is ordered by [row key, col key]. If the row key is empty, the only way to do this would be to explode the rows then shuffle to sort by col key.
I'm not sure what behavior we want for this. The empty row key is just the extreme case of having too coarse a row key, so that the rows having a given row key are too large to fit one partition and sort the entries by [row key, col key]. Until we can dynamically estimate the data sizes, we have to make some assumption about when a key's worth of data fits on a partition, and otherwise we have to shuffle.
If the col key was also empty, we could do this without a shuffle, because there is no ordering requirement on the entriesRVD. I could recognize and optimize that case as a partial solution.
I see the problem, yeah
we can't really ever shuffle the entries, though. We may need a solution that groups by key and explodes.
That's what it does now. If there's no row key, then group-by-key means one partition. The only non-shuffling fix I see is to handle separately the case where both keys are empty.
ah, I see. yes, let's do that.
if the row key is non empty and the col key is empty, though, this will still happen. We should warn or possibly fail in that case
What will still happen?
strictify will put it into one partition
You mean if row key is empty and col key is non-empty? Yes, I agree we should warn then.
after a few more minutes of thought, a warning is probably not sufficient here. We have to assume that losing parallelism means that the computation will never finish.
There aren't really any good options:
Unkey the cols seems wrong. It makes the contract of entries()
much more complicated. I guess a warning or error that makes the user decide between manually unkeying the columns or shuffling would be best? (I guess that means implementing a shuffling option for small data.)
seeing weird behavior:
In [2]: mt = hl.utils.range_matrix_table(3, 3)
In [3]: mt.annotate(x = hl.struct(r=mt.row_idx, c=mt.col_idx))
In [5]: mt = mt.annotate_entries(x = hl.struct(r=mt.row_idx, c=mt.col_idx))
In [6]: mt.x.collect()
Out[6]:
[Struct(r=0, c=0),
Struct(r=0, c=1),
Struct(r=0, c=2),
Struct(r=1, c=0),
Struct(r=1, c=1),
Struct(r=1, c=2),
Struct(r=2, c=0),
Struct(r=2, c=1),
Struct(r=2, c=2)]
In [7]: mt.key_rows_by().key_cols_by().x.collect()
Out[7]:
[Struct(r=0, c=0),
Struct(r=1, c=0),
Struct(r=2, c=0),
Struct(r=0, c=1),
Struct(r=1, c=1),
Struct(r=2, c=1),
Struct(r=0, c=2),
Struct(r=1, c=2),
Struct(r=2, c=2)]
However, my branch is behind #5799. I'll try again on that. The lowering implementation may have fixed things (or may have bugs!)
seeing weird behavior:
In [2]: mt = hl.utils.range_matrix_table(3, 3)
In [3]: mt.annotate(x = hl.struct(r=mt.row_idx, c=mt.col_idx))
In [5]: mt = mt.annotate_entries(x = hl.struct(r=mt.row_idx, c=mt.col_idx))
In [6]: mt.x.collect()
Out[6]:
[Struct(r=0, c=0),
Struct(r=0, c=1),
Struct(r=0, c=2),
Struct(r=1, c=0),
Struct(r=1, c=1),
Struct(r=1, c=2),
Struct(r=2, c=0),
Struct(r=2, c=1),
Struct(r=2, c=2)]
In [7]: mt.key_rows_by().key_cols_by().x.collect()
Out[7]:
[Struct(r=0, c=0),
Struct(r=1, c=0),
Struct(r=2, c=0),
Struct(r=0, c=1),
Struct(r=1, c=1),
Struct(r=2, c=1),
Struct(r=0, c=2),
Struct(r=1, c=2),
Struct(r=2, c=2)]
However, my branch is behind #5799. I'll try again on that. The lowering implementation may have fixed things (or may have bugs!)
OK, this is fixed with the lowering implementation, but we do have a bug if the row key is empty but the col key is present.
to replicate:
Seems to do with the strictify? I'm not totally sure though.