pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

referring to sample yaml in command template #272

Closed nsheff closed 2 months ago

nsheff commented 4 years ago

What's the sample yaml name referred to in the command template? We don't list it in the docs... is it looper.sample_yaml ?

stolarczyk commented 4 years ago

I don't think there is one.. should we add it?

nsheff commented 4 years ago

definitely.

There is a way to specify where it's saved... I guess we just need to have that available in the looper namespace...

stolarczyk commented 4 years ago

I was wrong, I made the path accessible, but via {sample.yaml_file}. I think it is even better this way, because looper namespace is submission-specific, not sample-specific. For instance, if we somehow moved sample yaml path info to the looper namespace and lumped samples, we would not have access to the individual sample yaml files in the command template. By keeping it in sample, we do have the access.

An example looper namespace (looper --verb 4 run --limit 2 --lumpn 2):

DEBU 09:38:06 | looper.conductor:conductor:431 > looper namespace:
AttMap
pep_config: /sfs/qumulo/qhome/mjs5kd/code/paqc/paqc.yaml
results_subdir: /project/shefflab/processed/paqc_michal/results_pipeline
submission_subdir: /project/shefflab/processed/paqc_michal/submission
output_dir: /project/shefflab/processed/paqc_michal
sample_output_folder: /project/shefflab/processed/paqc_michal/results_pipeline/lump1
job_name: PEPATAC_lump1
total_input_size: 5.297424949705601
log_file: /project/shefflab/processed/paqc_michal/submission/PEPATAC_lump1.log
command: /home/mjs5kd/code/pepatac/pipelines/pepatac.py --sample-name GSM4289908 --genome hg38 --input /project/shefflab/data/sra_fastq//SRR10988638.fastq.gz --single-or-paired SINGLE -O /project/shefflab/processed/paqc_michal/results_pipeline -P 32 -M 24000 --aaaa /project/shefflab/processed/paqc_michal/submission/GSM4289908.yaml      --prealignments human_rDNA                \n/home/mjs5kd/code/pepatac/pipelines/pepatac.py --sample-name GSM4196904 --genome hg38 --input /project/shefflab/data/sra_fastq//SRR10560444_1.fastq.gz --single-or-paired PAIRED -O /project/shefflab/processed/paqc_michal/results_pipeline -P 32 -M 24000 --aaaa /project/shefflab/processed/paqc_michal/submission/GSM4196904.yaml  --input2 /project/shefflab/data/sra_fastq//SRR10560444_2.fastq.gz      --prealignments human_rDNA 

This also made me realize that looper.sample_output_folder should probably be renamed to looper.job_output_folder since it's not neccessarily equivalent to sample_name

nsheff commented 4 years ago

that makes sense -- but the problem is, what if the sample table has a column named sample_yaml? Using looper to overwrite sample attributes is dangerous. It sounds like you need a looper.sample namespace.

On sample_output_folder -- here we say it is derived from sample name:

http://looper.databio.org/en/latest/variable-namespaces/

sample_output_folder -- a sample-specific output folder (results_subdir/sample.sample_name)

So, it should rather move out of the looper namespace, since it's sample-specific, like sample_yaml

stolarczyk commented 4 years ago

Using looper to overwrite sample attributes is dangerous

you're right, this needs to change then

sample_output_folder -- a sample-specific output folder (results_subdir/sample.sample_name) So, it should rather move out of the looper namespace, since it's sample-specific, like sample_yaml

I think it's not sample-specific, but submission-specific. When there is no lumping applied we just name the directory after sample.sample_name. In general you always get one sample_output_folder (or rather job_output_folder) per submission, not per sample. Unlike sample_yaml which is always one per sample. And yes, the docs are not specific enough. It's true just for the "no lump" scenario.

nsheff commented 4 years ago

you're right, this needs to change then

On the other hand... if we make it so you can use sample_yaml as a fixed attribute, as the way to specify the sample yaml path, instead of using sample_yaml_path like described here...

http://looper.databio.org/en/latest/pipeline-interface-specification/#sample_yaml_path

or in other words, you we only populate the sample yaml path if not set...then it would be fine. But on second thought, it's probably better the way it is.

also, does that mean you can use: {pipeline.sample_yaml_path} ? and if so, does that vary by sample? I guess that would give you, literally, {sample.sample_name}.yaml, not the populated version?

donaldcampbelljr commented 2 months ago

Currently, only the path is populated via sample.sample_yaml_path if you specify it in var_templates and run the pre-submit function.

pipeline_name: count_lines
pipeline_type: sample
var_templates:
  pipeline: '{looper.piface_dir}/count_lines.sh'
  sample_yaml_path: "{looper.output_dir}/custom_sample_yamls/{sample.sample_name}.yaml"
command_template: >
  {pipeline.var_templates.pipeline} {sample.file}
pre_submit:
  python_functions:
    - looper.write_sample_yaml

Looper will use the default path {looper.output_dir}/submission/{sample.sample_name}_sample.yaml. if the pre-submit-hook is used and NO sample_yaml_path is specified in the var_templates.

This attribute is accessible via the attribute sample.sample_yaml_path in the command template.

The looper docs are inaccurate. It gives the example:

  sample_yaml_path: "{looper.output_dir}/custom_sample_yamls"

However, this will cause an error, telling the user they must specify a path ending with .yaml.

Therefore, I tried and confirmed that something like this works fine:

  sample_yaml_path: "{looper.output_dir}/custom_sample_yamls/{sample.sample_name}.yaml"

To clarify, there is no sample.sample_yaml only sample.sample_yaml_path and this path is accessible in the sample namespace and is created using a default path or via a user-supplied path in the pipeline interface under var_templates

donaldcampbelljr commented 2 months ago
donaldcampbelljr commented 2 months ago

This will now be complete when the docs for Looper 1.9.0 are pushed to pepspec Master: https://github.com/pepkit/pepspec/pull/33