Allow multiple CUs to write output files to a single DU.

pradeepmantha commented 10 years ago

Use case - In case of MapReduce - multiple Map tasks generates data related to a single reduce task.

Currently all the output files of a Map task are stored together in a DU, and later all the files related to a reduce from Map tasks are segregated by the MapReduce framework to pass the DU's as inputs to reduce task.

This could be optimized, if we can allow a Map task to write output files to a reduce DU's directly.

I envision, this could be a useful feature for other use cases too.

There could be some concurrency problems with metdata udpate, I just want log this, as we might come up with some solution to make this possible.

pradeepmantha commented 10 years ago

The task is still in 'Running' until the output data of the task is moved, which is captured by output DU. This should be decoupled.. The task should be moved to 'Done' state, to improve concurrency and Output DU should take care of moving data.

marksantcroos commented 10 years ago

Hi Pradeep, apart from the semantic problems this would create, why is it an optimization to have all data in one DU?

pradeepmantha commented 10 years ago

consider below example - I have 1000 map tasks running, where each task generates set of output files.. which need to be grouped later into 8 DUs.... With current framework.. I need to create 1000 empty intermediate Output DUS to store files of each task.. and later create 8 more DUS to manage the files between DUs. With the 1000 tasks directly writing to 8 DUS, will eliminate the creation of intermediate 1000 DUS and related wait time. This is useful for applications where high performance is required, where even milli seconds matter.. I think, this is additional flexibility that can be provided, and its upto application to use it or not.

marksantcroos commented 10 years ago

Hi Pradeep,

On 19 Feb 2014, at 19:19 , pradeepmantha notifications@github.com wrote:

consider below example - I have 1000 map tasks running, where each task generates set of output files.. which need to be grouped later into 8 DUs….

Ok, clear.

With current framework.. I need to create 1000 empty intermediate Output DUS to store files of each task..

Ok.

and later create 8 more DUS to manage the files between DUs.

Arguably you don’t need those 8 DUs but could divide the 1000 DUs in 8 piles, right? (Not saying you should, but to get the semantics clear)

With the 1000 tasks directly writing to 8 DUS, will eliminate the creation of intermediate 1000 DUS and related wait time. This is useful for applications where high performance is required, where even milli seconds matter.. I think, this is additional flexibility that can be provided, and its upto application to use it or not.

I hear your concern about performance, which is important of course. When performance is at play, its not necessarily required to change semantics, changing implementation might be enough.

Having “mixed” writing to a DU opens up a whole can of worms:

What if the CU’s run on different DCIs with different transfer mechanisms?
What if you want to replicate a DU?
What will be the state model of a DU?

Note that I’m not against your “wish”, but your proposal has far reaching implications, so it needs to be considered carefully.

Gr,

Mark

pradeepmantha commented 10 years ago

Hi,

Arguably you don't need those 8 DUs but could divide the 1000 DUs in 8 piles, right?

No. The DU( of the 8DU group) needs one file from each DU ( 1000 DU group).. Explained the case below with detailed example..

i.e Task 1 creates - { file-0, file-1,file-2, file-3.... file-7} Task 2 creates - { file1-0, file1-1,file1-2.... file1-7} .... Task 1000 creates - { file999-0, file999-1,file999-2.... file999-7}

Now 8 Dus should look like.

DU -1 = { file1-0, file2-0, file3-0 .. file999-0}

DU-8 = { file 1-7,file2-7 ... file999-7}

so each DU ( of the 8DUS) need one file from the 1000 DUs, which is lot of performance impact in case of MapReduce framework. So, I can't simply divide into 1000 DUS into 8 groups.

Having "mixed" writing to a DU opens up a whole can of worms:

What if the CU's run on different DCIs with different transfer mechanisms?

     - I am still trying to understand how the mixing can be avoided

with Current PD design? Consider CUS running on XSEDE cluster that supports GlobusOnline PD, and FG Sierra(ssh/scp PD) and the target DU resides on Sierra.. with current PD

Condor tasks push data to Condor Pilot-data, Sierra tasks push data to Sierra Pilot-Data.. Then for grouping, 8 DUS should be prepared by mixing 1000 DUS(Which is not yet supported) from both condor and Sierra pilot-datas and moved to Sierra Pilot-Data.

           With CUS writing data directly to DU -
               - Condor tasks directly push data to DU on sierra, based

on the affinity of target DU & Pilot-Data affinity.

   Both the cases need some effort.

What if you want to replicate a DU?
- Could be similar to above..

What will be the state model of a DU?

   - Might affect the state of CU, as data transfer from CU to DU could

be long and need a intermediate state to avoid cpu resource blocking, until the transfer is finished.

Note that I'm not against your "wish", but your proposal has far reaching implications, so it needs to be considered carefully.

Yes, I am mostly thinking in perspective of Pilot-MapReduce, which is just one use case. But providing above features or hooks to have these kind of features, will help the performance of application. These could be some helper functions leaving application to figure out whats best for it.

Thanks Pradeep

On Wed, Feb 19, 2014 at 11:50 AM, Mark Santcroos notifications@github.comwrote:

Hi Pradeep,

On 19 Feb 2014, at 19:19 , pradeepmantha notifications@github.com wrote:

consider below example - I have 1000 map tasks running, where each task generates set of output files.. which need to be grouped later into 8 DUs....

Ok, clear.

With current framework.. I need to create 1000 empty intermediate Output DUS to store files of each task..

Ok.

and later create 8 more DUS to manage the files between DUs.

Arguably you don't need those 8 DUs but could divide the 1000 DUs in 8 piles, right? (Not saying you should, but to get the semantics clear)

With the 1000 tasks directly writing to 8 DUS, will eliminate the creation of intermediate 1000 DUS and related wait time. This is useful for applications where high performance is required, where even milli seconds matter.. I think, this is additional flexibility that can be provided, and its upto application to use it or not.

I hear your concern about performance, which is important of course. When performance is at play, its not necessarily required to change semantics, changing implementation might be enough.

Gr,

Mark

Reply to this email directly or view it on GitHubhttps://github.com/saga-project/BigJob/issues/173#issuecomment-35540272 .

marksantcroos commented 10 years ago

Hi Pradeep,

Still had this open, gave it some more thought, triggered by the Two Tales paper.

Long ago pradeepmantha notifications@github.com wrote:

Arguably you don't need those 8 DUs but could divide the 1000 DUs in 8 piles, right?

No. The DU( of the 8DU group) needs one file from each DU ( 1000 DU group).. Explained the case below with detailed example..

i.e Task 1 creates - { file-0, file-1,file-2, file-3.... file-7} Task 2 creates - { file1-0, file1-1,file1-2.... file1-7} .... Task 1000 creates - { file999-0, file999-1,file999-2.... file999-7}

Now 8 Dus should look like.

DU -1 = { file1-0, file2-0, file3-0 .. file999-0}

DU-8 = { file 1-7,file2-7 ... file999-7}

so each DU ( of the 8DUS) need one file from the 1000 DUs, which is lot of performance impact in case of MapReduce framework. So, I can't simply divide into 1000 DUS into 8 groups.

Ok, I didn’t understand the pattern correctly, thanks for clarifying.

But I’m still tempted to say that you should “just” either: A. create 8000 DU’s then. Then you either use all of these DUs independently or you merge them into 8 DUs. B. create 1000 DU’s then. Then you split and merge them into 8 new DUs.

If we don’t talk about performance at this stage, then from a semantical/expressiveness perspective, this both does what you need I believe.

Of course both A and B will have different performance characteristics, but I would consider that an optimisation problem once we settle on the semantics.

Having "mixed" writing to a DU opens up a whole can of worms:

What if the CU's run on different DCIs with different transfer mechanisms?

I am still trying to understand how the mixing can be avoided with Current PD design? Consider CUS running on XSEDE cluster that supports GlobusOnline PD, and FG Sierra(ssh/scp PD) and the target DU resides on Sierra.. with current PD

Condor tasks push data to Condor Pilot-data, Sierra tasks push data to Sierra Pilot-Data.. Then for grouping, 8 DUS should be prepared by mixing 1000 DUS(Which is not yet supported) from both condor and Sierra pilot-datas and moved to Sierra Pilot-Data.

With CUS writing data directly to DU -

Condor tasks directly push data to DU on sierra, based on the affinity of target DU & Pilot-Data affinity.

Both the cases need some effort.

Regardless of what route we go there, I’m tempted to say that the initial output DUs should be written to by just one CU. For the simple reasons that CUs on different DCIs might simply not be able to write to the same place. That is a real showstopper for mixed DU writing.

What if you want to replicate a DU?

Could be similar to above..

Thats a bit too ambiguous ... If multiple CUs can write to a DU, I think it complicates the state model, because its state now depends on many CUs instead of just 1, and thereby make the decision wether to replicate and when more difficult.

What will be the state model of a DU?

Might affect the state of CU, as data transfer from CU to DU could be long and need a intermediate state to avoid cpu resource blocking, until the transfer is finished.

See my comment about states above.

Overall, I fully agree that the current granularity of PD makes use cases like yours complex. But thats also why you created another layer that abstracts much of that away.

Then the issue of having to manage 1, 8, 1000, or 8000 DUs becomes “just” a performance issue. To work on the performance aspects, I can see that we need to create some more possibilities to split/merge DUs to support your use case as good as we can!

Gr,

Mark

saga-project / BigJob

Allow multiple CUs to write output files to a single DU. #173

DU -1 = { file1-0, file2-0, file3-0 .. file999-0}

DU -1 = { file1-0, file2-0, file3-0 .. file999-0}