pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

revamped resources with job-dependent attributes #236

Closed nsheff closed 4 years ago

nsheff commented 4 years ago

Related to #170

The resources section is currently used to specify variables that depend on the input size of the sample, which are usually compute-related variables like memory, time, and core count. The way we've done this has never been very flexible or intuitive. we need a new system.

Here are three ideas for discussion:

A function

interface can specify any python function that returns a Dict, which would be added (to the compute namespace, I suppose -- see #235).

Example:

compute: 
  bulker_crate: databio/peppro
  function_added_attributes:
    file_path: xyz.py
    function_name: myfunc
    parameters: [looper.total_input_size, sample.setting] 

Built-in function

basically mimics current functionality, with a new way to describe it.

compute: 
  bulker_crate: databio/peppro
  size_dependent_variables: resources.tsv 

resources.tsv (relative to pipeline interface yaml) would be:

max_file_size   cores   mem time
0.001   1   8000    00-04:00:00
0.05    2   12000   00-08:00:00
0.5 4   16000   00-12:00:00
1   8   16000   00-24:00:00
10  16  32000   02-00:00:00
NaN 32  32000   04-00:00:00

This file format requires 1 column with name max_file_size, but otherwise can have any other columns.

Arbitrary command template

any shell command that returns a JSON

compute: 
  bulker_crate: databio/peppro
  fluid_attributes: >
    xyz.py --size {{ looper.total_input_size }} --sample {{ sample.setting }}
stolarczyk commented 4 years ago

having discussed the possibilities with @nsheff, "Built-in function" and "Arbitrary command template" will be added to looper. In that order.

stolarczyk commented 4 years ago

added id column to the TSV, changed max_file_size to file_size. Both are required for this to work. for example:

id  file_size   cores   mem time
1   0   1   8000    00-04:00:00
2   0.05    2   12000   00-08:00:00
3   0.5 4   16000   00-12:00:00
4   1   8   16000   00-24:00:00
5   10  16  32000   02-00:00:00
nsheff commented 4 years ago
  1. can the id be automatically generated? I see no reason for the user to identify these (even though we originally did it that way).

2.I find 'max_file_size' more intuitive. is there a reason you prefer file_size? or just because that's how it was?

nsheff commented 4 years ago

Can you also make the 'catch all' with file_size as NA instead of 0?

we earlier called this 'default' and set 'file_size" must be 0. but this is confusing.

stolarczyk commented 4 years ago
  1. can the id be automatically generated? I see no reason for the user to identify these (even though we originally did it that way).

yes

2.I find 'max_file_size' more intuitive. is there a reason you prefer file_size? or just because that's how it was?

no reason, none of them was more intuivitve for me, so I went with a shoter one. Will change to max..

stolarczyk commented 4 years ago

don't you think that names of the size_dependent_variables and fluid_attriubutes should follow the same scheme? The functionality of the latter is an extension of the first one. For example: size_dependent_variables and attribute_dependent_variables

jpsmith5 commented 4 years ago
nsheff commented 4 years ago

yes. the columns are not set in stone, they are whatever variables you want. you would just have to make sure to provide divvy templates that understand them.

there are no docs yet as it's a brand new feature :)

stolarczyk commented 4 years ago

a dummy command I used for testing. It's a python script, made executable (test_script.py):

#!/usr/bin/env python3

import json
from argparse import ArgumentParser

parser = ArgumentParser(description="Test script")

parser.add_argument("-n", "--name", help="project name", required=True)
parser.add_argument("-g", "--genome", type=str, help="sample genome", required=True)
parser.add_argument("-l", "--log-file", type=str, help="log file", required=True)
args = parser.parse_args()

# process inputs here and create a dict

y = json.dumps({
    "cores": "11",
    "mem": "11111",
    "time": "00-11:00:00",
    "logfile": args.log_file
})

print(y)

and then in the compute section under a specific pipeline in pipeline interface file compose a command using this script:

pipelines:
  bedstat:
    name: XXX
    path: pipeline/XXX.py
    schema: pep_schema.yaml
   command_template: >
      { pipeline.path } --bedfile { sample.output_file_path }
   compute:
      fluid_attributes: >
        test_script.py --name {project.name} --genome {sample.genome} --log-file {looper.logfile}

keep in mind that fluid_attributes key name will be changed!

nsheff commented 4 years ago

For example: size_dependent_variables and attribute_dependent_variables

the only issue I have is that it almost makes it seem like they work in parallel ways, but they don't...

I think it's a template, so something parallel to command_template makes more sense to me.

attribute_command_template ?

vreuter commented 4 years ago

I basically agree wholeheartedly with @jpsmith5 's thoughts above.

jpsmith5 commented 4 years ago

Okay, this is helpful example @stolarczyk. Regarding constructing this myself, I guess duh there's no docs yet, but what I, admittedly, haven't gone looking for is where is there a listing of all attributes available to me? Is there a page that defines all project.*, sample.*, looper.* et cetera attributes that I could use in the name to be changed fluid_attributes section/function call?

nsheff commented 4 years ago

s there a listing of all attributes available

https://github.com/pepkit/looper/issues/235

jpsmith5 commented 4 years ago

s there a listing of all attributes available

235

Perfect! Exactly what I was looking for.

nsheff commented 4 years ago

keep in mind that fluid_attributes key name will be changed!

the only issue I have is that it almost makes it seem like they work in parallel ways, but they don't... I think it's a template, so something parallel to command_template makes more sense to me. attribute_command_template ?

what was the final name choices for this?

stolarczyk commented 4 years ago

no changes were made yet. So it' still fluid_attributes and size_dependent_variables

stolarczyk commented 4 years ago

what about variables_command_template and size_dependent_variables?

nsheff commented 4 years ago

what about 'attributes' instead of 'variables'?

attributes_command_template and size_dependent_attributes then?

I might even go to:

dynamic_attributes_command_template -- is that too verbose? we can call these things dynamic attributes.

nsheff commented 4 years ago

Now I'm going back and think maybe variables is better than attributes, to distinguish from the way we use attributes in PEP.

stolarczyk commented 4 years ago

Now I'm going back and think maybe variables is better than attributes, to distinguish from the way we use attributes in PEP.

that's exactly why I proposed "variables"

nsheff commented 4 years ago

ok agreed, but

dynamic_attributes_command_template vs attributes_command_template

stolarczyk commented 4 years ago

I'm fine with the more descriptive one: dynamic_variables_command_template