pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Clarity on looper config file vs project config file (for pepatac) #432

Closed rcorces closed 6 months ago

rcorces commented 6 months ago

We've been using pepatac for a long time with looper etc. I'm trying to update all of the infrastructure because we've been having issues with compatibility of packages etc. So I'm moving from looper 1.3.2 to looper 1.5.1. This understandably breaks some things and I'm having trouble understanding some of the looper documentation.

Specifically, I get this warning: https://github.com/pepkit/looper/blob/5c499a2d33451432fc5cdadf5a964c627c1b87c7/looper/looper.py#L1062-L1064

The here is more information is presumably what I'm missing.

I've found this page https://looper.databio.org/en/latest/how_to_define_looper_config/#how-to-run-pipeline-using-looper-config-file but I havent quite been able to figure out how I'm supposed to change my config files to comply with the new standard.

It isnt clear to me what information should be in the looper config file vs the project config file and the logic behind doing it this way (I'm sure there is logic and I think that logic would help understand the difference).

This is what my current project config file looks like:

# This project config file describes your project. See looper docs for details.
name: HMC3_test_project # The name that summary files will be prefaced with

pep_version: 2.0.0
sample_table: HMC3_test_annotation.csv  # sheet listing all samples in the project

looper:  # relative paths are relative to this config file
  output_dir: "/corces/home/$USER/temp/pepatac_out_qc/"  # ABSOLUTE PATH to the parent, shared space where project results go
  pipeline_interfaces: ["/corces/home/shared/pipelines/pepatac_qc_config_pelayo/pelayo_project_pipeline_interface.yaml"]  # ABSOLUTE PATH to the directory where looper will find the pipeline repository

sample_modifiers:
  append:
    pipeline_interfaces: ["/corces/home/shared/pipelines/pepatac_qc_config_pelayo/pelayo_sample_pipeline_interface.yaml"]
  derive:
    attributes: [read1, read2]
    sources:
      read1: "/corces/home/shared/pipelines/pepatac_test_files/{sample_name}*R1*.fastq.gz"
      read2: "/corces/home/shared/pipelines/pepatac_test_files/{sample_name}*R2*.fastq.gz"
  imply:
    - if: 
        organism: ["human", "Homo sapiens", "Human", "Homo_sapiens"]
      then: 
        genome: hg38
        macs_genome_size: hs
        prealignments: rCRSd human_alphasat human_alu human_rDNA human_repeats
        aligner: bowtie2         # Default. [options: bwa]
        deduplicator: samblaster # Default. [options: picard]
        trimmer: skewer          # Default. [options: pyadapt, trimmomatic]
        peak_type: fixed         # Default. [options: variable]
        extend: "250"            # Default. For fixed-width peaks, extend this distance up- and down-stream.
        frip_ref_peaks: None     # Default. Use an external reference set of peaks instead of the peaks called from this run
        blacklist: $GENOMES/hg38/blacklist/default/hg38_blacklist.bed.gz

Appreciate any help you can provide!

rcorces commented 6 months ago

Maybe to add more information in case it is helpful. Ive tried the following:

looper config file

pep_config: /corces/home/shared/pipelines/pepatac_test_files/HMC3_test_PEPconfig.yaml
output_dir: "/corces/home/$USER/temp/pepatac_out_qc/"
pipeline_interfaces:
  sample: ["/corces/home/shared/pipelines/pepatac_qc_config_pelayo/pelayo_sample_pipeline_interface.yaml"]
  project: ["/corces/home/shared/pipelines/pepatac_qc_config_pelayo/pelayo_project_pipeline_interface.yaml"]

project config file:

name: HMC3_test_project # The name that summary files will be prefaced with

pep_version: 2.0.0
sample_table: HMC3_test_annotation.csv  # sheet listing all samples in the project

sample_modifiers:
  append:
    pipeline_interfaces: ["/corces/home/shared/pipelines/pepatac_qc_config_pelayo/pelayo_sample_pipeline_interface.yaml"]
  derive:
    attributes: [read1, read2]
    sources:
      read1: "/corces/home/shared/pipelines/pepatac_test_files/{sample_name}*R1*.fastq.gz"
      read2: "/corces/home/shared/pipelines/pepatac_test_files/{sample_name}*R2*.fastq.gz"
  imply:
    - if:
        organism: ["human", "Homo sapiens", "Human", "Homo_sapiens"]
      then:
        genome: hg38
        macs_genome_size: hs
        prealignments: rCRSd human_alphasat human_alu human_rDNA human_repeats
        aligner: bowtie2         # Default. [options: bwa]
        deduplicator: samblaster # Default. [options: picard]
        trimmer: skewer          # Default. [options: pyadapt, trimmomatic]
        peak_type: fixed         # Default. [options: variable]
        extend: "250"            # Default. For fixed-width peaks, extend this distance up- and down-stream.
        frip_ref_peaks: None     # Default. Use an external reference set of peaks instead of the peaks called from this run
        blacklist: $GENOMES/hg38/blacklist/default/hg38_blacklist.bed.gz

and then running looper with:

looper run --looper-config ./looper.config ./path/to/project_config.yaml

But I still get the deprecated warning

rcorces commented 6 months ago

Ok. I think I've figured out the format issue that I was having. looks like the correct format would be:

looper config file (I think the issue was with using [""] in the sample part of `pipeline_interfaces)

pep_config: HMC3_test_project_config.yaml
output_dir: "/corces/home/$USER/temp/pepatac_out_qc/"
pipeline_interfaces:
  sample: /corces/home/shared/pipelines/pepatac_qc_config_pelayo/pelayo_sample_pipeline_interface.yaml

I'm still not sure I understand the advantage of having a separate looper config file but I'll close this as solved.