mitdbg / aurum-datadiscovery

MIT License
74 stars 49 forks source link

Parallelism granularity #48

Open raulcf opened 8 years ago

raulcf commented 8 years ago

Right now is table to avoid redundant reads. By splitting data on memory we could provide more fine-granular parallelism, i.e. per column, while still avoiding redundant reads.

raulcf commented 7 years ago

Some evidence: in data.gov 30/10K files contain ~50% data

sgt101 commented 7 years ago

Interesting stat.... What does it mean? Amount of data in mbs? Or number of rows?

We could find these stats for BT's warehouses if that helps?

Simon On Wednesday, October 26, 2016, Raul notifications@github.com wrote:

Some evidence: in data.gov 30/10K files contain ~50% data

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.< https://ci3.googleusercontent.com/proxy/3KYElAXuMB7ZvhVBaiLCSFqX09K_1FicMFlo_nJBIe1Szg41zZIW_wI8zXpX3SYuM-KbQPiV6Qww_B3uNG9I1FVDK852yCmrgoesYeMImohyrG-FyiqFPx9tMoo4xrb71SZcS75wYXjFQ1uQvigwM9kP28rb-Q=s0-d-e1-ft#https://github.com/notifications/beacon/AATC_qy74A2hmIBU77xmvpCGfVs0kY50ks5q30zRgaJpZM4JQwVL.gif>

mansoure commented 7 years ago

This is very interesting problem. @raulcf how far are you with this issue? I think I can help here.