matsengrp / phip-flow

A Nextflow pipeline to align, merge, and organize large PhIP-Seq datasets
MIT License
10 stars 6 forks source link

Multi-library Manifest #36

Closed jgallowa07 closed 2 years ago

jgallowa07 commented 2 years ago

Ultimately, we could replace both the sample table and the peptide table parameters with a single manifest file (CSV) which will look almost exactly the same as the sample table -- except now there is a required "peptide_reference" that points to a relative filepath to the the correct peptide table for each respective sample manifest. With a little tweaking the pipeline could then run library workflows in parallel i.e. If there are N unique peptide tables in the manifest "peptide_reference" column, across all samples, then the workflow will be run in parallel and produce N different data sets.

jgallowa07 commented 2 years ago

@sminot - I'd love to hear your thoughts on this.

Some background

The Workflow currently requires the user to specify a single library reference (peptide_table). However, the alignment, collection, and merging steps of the pipeline are already setup to run separately for the N peptide references being specified - This is how the pipeline was setup originally a long time ago . In short, It would be trivial to convert back to this multi-library functionality.

Is it a worthwhile feature?

I can't really imagine the majority of folks having many libraries - however, if they do it would be cool if they could keep all their samples in a single manifest, and simply specify a "peptide_reference" column with a relative path pointing to the respective library to align to, for each sample. It would certainly make things organized for folks who do generate many small custom libraries as well do and want to keep sample data around in a single file - as well as being able to run workflows on all the libraries in parallel. Then again, this could just be excessive feature loading, and a hassle to specify the reference path in a manifest column rather than a single reference parameter in the config file. (and then if you really have multiple libraries you just split the manifests and config files into separate sets to be run independently from one another).

sminot commented 2 years ago

When we consider particular features, it's useful to start with some assumptions about how the pipeline will be used. The single library reference is really better for people who will be processing multiple datasets with the same library. The multiple library reference is only going to be useful for people who want to process data from many different libraries and combine the output data into a single object.

In my own understanding of the PhIP-seq technology, I can definitely see why a user may want to process data from multiple libraries. However, I'm not sure that I can see the utility of having the outputs from different libraries combined into the same output files. This could be my own lack of imagination, of course.

Overall, I'm having trouble seeing the utility of the multi-library functionality which couldn't be accomplished just as easily by running the workflow once for each library. The advantage of separate runs for each peptide library would also seem to be that the outputs can easily be kept distinct. I'm thinking about edge cases like having the same peptide ID across two different libraries which may not correspond to the same peptide sequence, but which may be collapsed into the same feature if multiple libraries were provided.

That's my two cents -- only one peptide library per run to avoid adding needless complexity.

jgallowa07 commented 2 years ago

Good points - I'll point out that including multiple N peptide libraries actually results in N separate datasets - it essentially would split the manifest and run the pipeline N times in parallel. But I guess a bash loop through multiple sets of sample and peptide tables could accomplish the same thing. Alright, agreed not worth the extra required column or time. Thanks, @sminot!