Closed nsheff closed 4 years ago
having discussed the possibilities with @nsheff, "Built-in function" and "Arbitrary command template" will be added to looper. In that order.
added id
column to the TSV, changed max_file_size
to file_size
. Both are required for this to work. for example:
id file_size cores mem time
1 0 1 8000 00-04:00:00
2 0.05 2 12000 00-08:00:00
3 0.5 4 16000 00-12:00:00
4 1 8 16000 00-24:00:00
5 10 16 32000 02-00:00:00
2.I find 'max_file_size' more intuitive. is there a reason you prefer file_size? or just because that's how it was?
Can you also make the 'catch all' with file_size as NA instead of 0?
we earlier called this 'default' and set 'file_size" must be 0. but this is confusing.
- can the id be automatically generated? I see no reason for the user to identify these (even though we originally did it that way).
yes
2.I find 'max_file_size' more intuitive. is there a reason you prefer file_size? or just because that's how it was?
no reason, none of them was more intuivitve for me, so I went with a shoter one. Will change to max..
don't you think that names of the size_dependent_variables
and fluid_attriubutes
should follow the same scheme? The functionality of the latter is an extension of the first one.
For example: size_dependent_variables and attribute_dependent_variables
resources.tsv
concept seems pretty clear. To ensure I understand the full power there, could I also include any additional cluster related request? Can there also be columns for partition
and account
for example? For example, for a very large file I may want to request so much memory it supports using a largemem partition?attribute_dependent_variables
terminology, following the size_dependent
naming convention.
yes. the columns are not set in stone, they are whatever variables you want. you would just have to make sure to provide divvy templates that understand them.
there are no docs yet as it's a brand new feature :)
a dummy command I used for testing. It's a python script, made executable (test_script.py
):
#!/usr/bin/env python3
import json
from argparse import ArgumentParser
parser = ArgumentParser(description="Test script")
parser.add_argument("-n", "--name", help="project name", required=True)
parser.add_argument("-g", "--genome", type=str, help="sample genome", required=True)
parser.add_argument("-l", "--log-file", type=str, help="log file", required=True)
args = parser.parse_args()
# process inputs here and create a dict
y = json.dumps({
"cores": "11",
"mem": "11111",
"time": "00-11:00:00",
"logfile": args.log_file
})
print(y)
and then in the compute
section under a specific pipeline in pipeline interface file compose a command using this script:
pipelines:
bedstat:
name: XXX
path: pipeline/XXX.py
schema: pep_schema.yaml
command_template: >
{ pipeline.path } --bedfile { sample.output_file_path }
compute:
fluid_attributes: >
test_script.py --name {project.name} --genome {sample.genome} --log-file {looper.logfile}
keep in mind that fluid_attributes
key name will be changed!
For example: size_dependent_variables and attribute_dependent_variables
the only issue I have is that it almost makes it seem like they work in parallel ways, but they don't...
I think it's a template, so something parallel to command_template
makes more sense to me.
attribute_command_template ?
I basically agree wholeheartedly with @jpsmith5 's thoughts above.
Okay, this is helpful example @stolarczyk. Regarding constructing this myself, I guess duh there's no docs yet, but what I, admittedly, haven't gone looking for is where is there a listing of all attributes available to me? Is there a page that defines all project.*, sample.*, looper.* et cetera attributes that I could use in the name to be changed fluid_attributes
section/function call?
s there a listing of all attributes available
s there a listing of all attributes available
235
Perfect! Exactly what I was looking for.
keep in mind that fluid_attributes key name will be changed!
the only issue I have is that it almost makes it seem like they work in parallel ways, but they don't... I think it's a template, so something parallel to command_template makes more sense to me. attribute_command_template ?
what was the final name choices for this?
no changes were made yet. So it' still fluid_attributes
and size_dependent_variables
what about variables_command_template
and size_dependent_variables
?
what about 'attributes' instead of 'variables'?
attributes_command_template
and size_dependent_attributes
then?
I might even go to:
dynamic_attributes_command_template
-- is that too verbose? we can call these things dynamic attributes.
Now I'm going back and think maybe variables
is better than attributes
, to distinguish from the way we use attributes in PEP.
Now I'm going back and think maybe
variables
is better thanattributes
, to distinguish from the way we use attributes in PEP.
that's exactly why I proposed "variables"
ok agreed, but
dynamic_attributes_command_template vs attributes_command_template
I'm fine with the more descriptive one: dynamic_variables_command_template
Related to #170
The
resources
section is currently used to specify variables that depend on the input size of the sample, which are usually compute-related variables like memory, time, and core count. The way we've done this has never been very flexible or intuitive. we need a new system.Here are three ideas for discussion:
A function
interface can specify any python function that returns a Dict, which would be added (to the
compute
namespace, I suppose -- see #235).Example:
Built-in function
basically mimics current functionality, with a new way to describe it.
resources.tsv
(relative to pipeline interface yaml) would be:This file format requires 1 column with name
max_file_size
, but otherwise can have any other columns.Arbitrary command template
any shell command that returns a JSON