hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
966 stars 242 forks source link

MatrixTable.entriesRVD() puts everything in 1 partition if there is no row key #4646

Closed tpoterba closed 5 years ago

tpoterba commented 5 years ago

to replicate:

mt.key_rows_by().entries().n_partitions()

Seems to do with the strictify? I'm not totally sure though.

tpoterba commented 5 years ago

While you're at it, can you remove entriesRVD warning if col key is empty?

patrick-schultz commented 5 years ago

The strictify is needed to ensure that the resulting entriesRVD is ordered by [row key, col key]. If the row key is empty, the only way to do this would be to explode the rows then shuffle to sort by col key.

I'm not sure what behavior we want for this. The empty row key is just the extreme case of having too coarse a row key, so that the rows having a given row key are too large to fit one partition and sort the entries by [row key, col key]. Until we can dynamically estimate the data sizes, we have to make some assumption about when a key's worth of data fits on a partition, and otherwise we have to shuffle.

If the col key was also empty, we could do this without a shuffle, because there is no ordering requirement on the entriesRVD. I could recognize and optimize that case as a partial solution.

tpoterba commented 5 years ago

I see the problem, yeah

tpoterba commented 5 years ago

we can't really ever shuffle the entries, though. We may need a solution that groups by key and explodes.

patrick-schultz commented 5 years ago

That's what it does now. If there's no row key, then group-by-key means one partition. The only non-shuffling fix I see is to handle separately the case where both keys are empty.

tpoterba commented 5 years ago

ah, I see. yes, let's do that.

tpoterba commented 5 years ago

if the row key is non empty and the col key is empty, though, this will still happen. We should warn or possibly fail in that case

patrick-schultz commented 5 years ago

What will still happen?

tpoterba commented 5 years ago

strictify will put it into one partition

patrick-schultz commented 5 years ago

You mean if row key is empty and col key is non-empty? Yes, I agree we should warn then.

tpoterba commented 5 years ago

after a few more minutes of thought, a warning is probably not sufficient here. We have to assume that losing parallelism means that the computation will never finish.

There aren't really any good options:

patrick-schultz commented 5 years ago

Unkey the cols seems wrong. It makes the contract of entries() much more complicated. I guess a warning or error that makes the user decide between manually unkeying the columns or shuffling would be best? (I guess that means implementing a shuffling option for small data.)

tpoterba commented 5 years ago

seeing weird behavior:

In [2]: mt = hl.utils.range_matrix_table(3, 3)

In [3]: mt.annotate(x = hl.struct(r=mt.row_idx, c=mt.col_idx))

In [5]: mt = mt.annotate_entries(x = hl.struct(r=mt.row_idx, c=mt.col_idx))

In [6]: mt.x.collect()
Out[6]:
[Struct(r=0, c=0),
 Struct(r=0, c=1),
 Struct(r=0, c=2),
 Struct(r=1, c=0),
 Struct(r=1, c=1),
 Struct(r=1, c=2),
 Struct(r=2, c=0),
 Struct(r=2, c=1),
 Struct(r=2, c=2)]

In [7]: mt.key_rows_by().key_cols_by().x.collect()
Out[7]:
[Struct(r=0, c=0),
 Struct(r=1, c=0),
 Struct(r=2, c=0),
 Struct(r=0, c=1),
 Struct(r=1, c=1),
 Struct(r=2, c=1),
 Struct(r=0, c=2),
 Struct(r=1, c=2),
 Struct(r=2, c=2)]

However, my branch is behind #5799. I'll try again on that. The lowering implementation may have fixed things (or may have bugs!)

tpoterba commented 5 years ago

seeing weird behavior:

In [2]: mt = hl.utils.range_matrix_table(3, 3)

In [3]: mt.annotate(x = hl.struct(r=mt.row_idx, c=mt.col_idx))

In [5]: mt = mt.annotate_entries(x = hl.struct(r=mt.row_idx, c=mt.col_idx))

In [6]: mt.x.collect()
Out[6]:
[Struct(r=0, c=0),
 Struct(r=0, c=1),
 Struct(r=0, c=2),
 Struct(r=1, c=0),
 Struct(r=1, c=1),
 Struct(r=1, c=2),
 Struct(r=2, c=0),
 Struct(r=2, c=1),
 Struct(r=2, c=2)]

In [7]: mt.key_rows_by().key_cols_by().x.collect()
Out[7]:
[Struct(r=0, c=0),
 Struct(r=1, c=0),
 Struct(r=2, c=0),
 Struct(r=0, c=1),
 Struct(r=1, c=1),
 Struct(r=2, c=1),
 Struct(r=0, c=2),
 Struct(r=1, c=2),
 Struct(r=2, c=2)]

However, my branch is behind #5799. I'll try again on that. The lowering implementation may have fixed things (or may have bugs!)

tpoterba commented 5 years ago

OK, this is fixed with the lowering implementation, but we do have a bug if the row key is empty but the col key is present.