saga-project / BigJob

SAGA-based Pilot-Job Implementation for Compute and Data
http://saga-project.github.com/BigJob/
Other
19 stars 8 forks source link

Allow to pass required files to a CU via input_data CUD attribute. #174

Open pradeepmantha opened 10 years ago

pradeepmantha commented 10 years ago

Having something like will be great. Currently we need to segregate DUs with all the files required for a CU. But this could be optimized and avoid unnecessary DU creation.

    """ Parsing input data field of job description:
        {
        ...
         "input_data": [
                        {
                         input_data_unit.get_url(): 
                         ["file1","file2"]
                        }
                        ]

        or

        "input_data": [
                        input_data_unit.get_url()                                                         
                     ]                        
        }    
    """   
marksantcroos commented 10 years ago

Hi Pradeep,

Can you elaborate, I don’t really understand what you mean.

Thanks

Gr,

Mark

On 09 Feb 2014, at 20:19 , pradeepmantha notifications@github.com wrote:

Having something like will be great. Currently we need to segregate DUs with all the files required for a CU. But this could be optimized and avoid unnecessary DU creation.

""" Parsing input data field of job description:
    {
    ...
     "input_data": [
                    {
                     input_data_unit.get_url(): 
                     ["file1","file2"]
                    }
                    ]

    or

    "input_data": [
                    input_data_unit.get_url()                                                         
                 ]                        
    }    
"""   

— Reply to this email directly or view it on GitHub.

pradeepmantha commented 10 years ago

Consider below example - I have a task which created 1000 files, where for each file, I wanna create a task, which takes the file itself as input. With current 'input_data' CUD attribute, I can only pass, all the contents of DU. So, I either need to create 1000 intermediate DUS, one for each file, and pass the DU as input to the task, or pass the 1000 files for each CU without creating intermediate DUS. Allowing to Specify required input files as below, will help to avoid intermediate creation of DUS and just get the required files from the DUS.

"input_data": [ { input_data_unit.get_url(): ["file1","file2"] }

Again this could be a flexibility that user/application can use.

marksantcroos commented 10 years ago

Hi Pradeep,

On 19 Feb 2014, at 19:26 , pradeepmantha notifications@github.com wrote:

Consider below example - I have a task which created 1000 files, where for each file, I wanna create a task, which takes the file itself as input.

Ok, clear.

With current 'input_data' CUD attribute, I can only pass, all the contents of DU.

Correct, DU’s are atomic units for good reasons.

So, I either need to create 1000 intermediate DUS, one for each file, and pass the DU as input to the task,

Agreed, whats the problem with that? Isn’t that exactly what you want in this situation?

or pass the 1000 files for each CU without creating intermediate DUS.

Thats obviously not what you want.

Allowing to Specify required input files as below, will help to avoid intermediate creation of DUS and just get the required files from the DUS.

"input_data": [ { input_data_unit.get_url(): ["file1","file2"] }

What does the “just get the required files” actually mean here? What are the exact semantics of that?

In general, I believe I see what you want to do, but as far as I can tell this can be expressed perfectly with the current model, without breaking the actual semantics.

More over, for this specific pattern, it makes sense to add a layer on top of PD, which is exactly what do you did, right?

Gr,

Mark

pradeepmantha commented 10 years ago

Hi,

On Wed, Feb 19, 2014 at 11:36 AM, Mark Santcroos notifications@github.comwrote:

Hi Pradeep,

On 19 Feb 2014, at 19:26 , pradeepmantha notifications@github.com wrote:

Consider below example - I have a task which created 1000 files, where for each file, I wanna create a task, which takes the file itself as input.

Ok, clear.

With current 'input_data' CUD attribute, I can only pass, all the contents of DU.

Correct, DU's are atomic units for good reasons.

So, I either need to create 1000 intermediate DUS, one for each file, and pass the DU as input to the task,

Agreed, whats the problem with that? Isn't that exactly what you want in this situation?

- It works, but need to create intermediate 1000 DUs,  Just want to

avoid that for performance reasons.

or pass the 1000 files for each CU without creating intermediate DUS.

Thats obviously not what you want.

Allowing to Specify required input files as below, will help to avoid intermediate creation of DUS and just get the required files from the DUS.

"input_data": [ { input_data_unit.get_url(): ["file1","file2"] }

What does the "just get the required files" actually mean here? What are the exact semantics of that?

  • Its the same semantics analogous to how the "output_data" CUD attribute currently behaves.

In general, I believe I see what you want to do, but as far as I can tell this can be expressed perfectly with the current model, without breaking the actual semantics.

  • Yes, its just implementation.. I actually implemented in Pradeep branch of BigJob.

More over, for this specific pattern, it makes sense to add a layer on top of PD, which is exactly what do you did, right?

- Yes.

Gr,

Mark

Reply to this email directly or view it on GitHubhttps://github.com/saga-project/BigJob/issues/174#issuecomment-35538681 .