matsengrp / phip-flow

A Nextflow pipeline to align, merge, and organize large PhIP-Seq datasets
MIT License
10 stars 6 forks source link

patch aggregate workflow #69

Closed jgallowa07 closed 10 months ago

jgallowa07 commented 10 months ago

@sminot

In response to #68 , I've patched the input files to include the ".gz" extension, here.

Also discussed in that issue, I think the default behavior when a user does not provide the --sample_grouping_col (e.g. if no sample replicates exist), then the workflow should simply skip aggregating samples. It seems simply setting the default value for that parameter as "sample_id" as the default value won't work either, AFAICT because sample_id is used as the index for locating samples when you shard the aggregate_organisms step. Can you think of an easy way to skip sample aggregation?

P.S. I'm not sure if you have a Virscan testing/dev set, but I'm just running

nextflow run main.nf \
        --summarize_by_organism true \
        --peptide_seq_col "Prot" \
        --peptide_org_col "Virus" \
        --sample_grouping_col "technical_replicate_id" \
        -profile docker \
        --results "$(date -I)"

which runs the default pan-CoV-example data data. I think this should be fine?

All advice (/ pushes to this branch) welcomed and appreciated!

sminot commented 10 months ago

I just pushed a patch which I think is handling this use case gracefully. Local tests appear to be working.

jgallowa07 commented 10 months ago

Beautiful, working on my end as well. Huge thanks, @sminot