ilovesoup / hyracks

Automatically exported from code.google.com/p/hyracks
Apache License 2.0
0 stars 0 forks source link

BLOCKING: OOM in complex job on Y! cluster #78

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I hit an OOM when I run the following query:

select
  n_name, sum(l_extendedprice * (1 - l_discount)) as revenue
from
  customer c join
    ( select n_name, l_extendedprice, l_discount, s_nationkey, o_custkey from orders o join
      ( select n_name, l_extendedprice, l_discount, l_orderkey, s_nationkey from lineitem l join
        ( select n_name, s_suppkey, s_nationkey from supplier s join
          ( select n_name, n_nationkey
            from nation n join region r
            on n.n_regionkey = r.r_regionkey and r.r_name = 'ASIA'
          ) n1 on s.s_nationkey = n1.n_nationkey
        ) s1 on l.l_suppkey = s1.s_suppkey
      ) l1 on l1.l_orderkey = o.o_orderkey and o.o_orderdate >= '1994-01-01'
              and o.o_orderdate < '1995-01-01'
) o1
on c.c_nationkey = o1.s_nationkey and c.c_custkey = o1.o_custkey
group by n_name
order by revenue desc;

Both join buffer and grouping buffer are set to 32MB.
In asterix cluster, there is no OOM when join buffer and grouping buffer are 
256MB.

However, in Y! cluster, the following OOM happens:

Exception in thread "Thread-1" java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39)
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
        at edu.uci.ics.hyracks.control.nc.runtime.RootHyracksContext.allocateFrame(RootHyracksContext.java:44)
        at edu.uci.ics.hyracks.control.nc.net.NetworkInputChannel.open(NetworkInputChannel.java:100)
        at edu.uci.ics.hyracks.dataflow.std.collectors.PartitionCollector.addPartitions(PartitionCollector.java:54)
        at edu.uci.ics.hyracks.control.nc.Joblet.reportPartitionAvailability(Joblet.java:236)
        at edu.uci.ics.hyracks.control.nc.work.ReportPartitionAvailabilityWork.doRun(ReportPartitionAvailabilityWork.java:54)
        at edu.uci.ics.hyracks.control.common.work.SynchronizableWork.run(SynchronizableWork.java:32)
        at edu.uci.ics.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:115)

The degree of parallelism is 680 on Y! cluster, so that there are many channels 
(connections) between operators.

Original issue reported on code.google.com by buyingyi@gmail.com on 16 Jun 2012 at 8:58

GoogleCodeExporter commented 9 years ago
I set the buffer size of NetworkInputChannel and NetworkOutputChannel to be 1 
(originally it was 5). The OOM seems gone.

However, this query hangs during execution. Attached contains all the logs with 
JVM dump and the job-run file.

Original comment by buyingyi@gmail.com on 17 Jun 2012 at 8:15

Attachments:

GoogleCodeExporter commented 9 years ago
Can you try to start with the innermost query and grow it one join at a time 
till you see issues?

Original comment by vinay...@gmail.com on 17 Jun 2012 at 6:20

GoogleCodeExporter commented 9 years ago
This simplified join query does not work:

select O_TOTALPRICE
from orders join customer
on orders.O_CUSTKEY=customer.C_CUSTKEY;

The query hangs on the cluster, no exception but no progress.
Attached are cc/ncs logs and the job-run file from adminconsole.

Original comment by buyingyi@gmail.com on 17 Jun 2012 at 10:53

Attachments: