loosolab / TOBIAS_snakemake

Snakemake pipeline for running TOBIAS analysis
MIT License
4 stars 2 forks source link

Questions regarding formatting of configfile #10

Closed sufyazi closed 1 year ago

sufyazi commented 1 year ago

Hi Mette,

Nice work on the snakemake pipeline. It's very convenient and removes the need for me to write one up for running a massive footprinting analysis for our project!

I have some questions regarding the formatting of the configfile:

  1. If I have multiple input bam files, I assume I should provide them as paths to these files in the input data section as a Python list? So formatted with a comma and a space like so [x, x, x, ...]? I also assume the merging of these files would be done in the pre-processing step? I had taken a quick look into the Snakefiles but it is still not very clear as I am new to snakemake.

  2. If I want to run the pipeline to do a single-condition footprinting, how do I specify that in the config? Just delete the extra line specifying the second condition?

  3. If I already have merged peaks for both conditions, can I just start the pipeline halfway through? I see that in the example config you commented out a line specifying path to merged peaks file (annotated), I assume I can just uncomment this? What if my merged peaks aren't annotated? Will this snakemake pipeline know where to pick up?

msbentsen commented 1 year ago

Hi, to answer your questions:


If I have multiple input bam files, I assume I should provide them as paths to these files in the input data section as a Python list?

Yes, you just give a list of bam-files like:

Tcell: [data/Tcell_day1.bam, data/Tcell_day1.bam]  #list of .bam-files

These files are then merged during preprocessing here: https://github.com/loosolab/TOBIAS_snakemake/blob/842b6c897d6bdb33e574199367c4d54e8c9e9592/snakefiles/preprocessing.snake#L49-L56


If I want to run the pipeline to do a single-condition footprinting, how do I specify that in the config?

Yes, you just delete the other condition, see for example here from a previous issue: https://github.com/loosolab/TOBIAS_snakemake/issues/7#issuecomment-1564339732


If I already have merged peaks for both conditions, can I just start the pipeline halfway through?

Indeed you can just uncomment this and add your own file. The pipeline assumes that the .bed-file is annotated with additional columns e.g. nearby gene etc., but it is not strictly important for the run. The information is not used for anything other than carrying the columns over into the output TFBS .bed-files, so it is just a way to annotate peaks -> gene prior to the run. We use our own UROPA software for this step.

sufyazi commented 1 year ago

Hi Mette,

Thanks for the clarification. I have been thinking of parallelizing my workflow because I would need to do pairwise comparison between different condition samples with different control samples, but I recently learned that BinDetect can actually process more than 2 files.

So my next question is: if I were to do this using this snakemake pipeline, how would I go about doing that? Just input say:

Conditions: [data/conditionA.bam, data/conditionB.bam, data/conditionC.bam, data/conditionD.bam]
Controls: [data/controlA.bam, data/controlB.bam, data/controlC.bam, data/controlD.bam] 

like that? If yes, is there a way to finetune the pairwise matching or does this use snakemake wildcards under the hood so it's not easy to finetune it?

By finetuning I meant guiding the pairwise comparison so that it only does it it a certain way; in my case, I am only interested in the combination of condA - controlA, condA - controlB, condC - controlC, etc, (intergroup comparison), not condA - condB, condB - condC, or controlA - controlC, controlA - controlD (intragroup comparison). Just wondering if this is possible with the current implementation or if I need to think of another strategy.

msbentsen commented 1 year ago

In order to compare all conditions with all controls, you would need to set these as individual conditions in the config file, e.g.:

conditionA:  [data/conditionA.bam]
conditionB: [data/conditionB.bam]   # or [data/conditionB_rep1.bam, data/conditionB_rep2.bam] in case of more files
(...)
controlA: [data/controlA.bam]
(...)

You can unfortunately not choose which pairwise comparisons are created during the run, but you can always do your own filtering afterwards to only show the contrasts of interest.

sufyazi commented 1 year ago

Thank you. That makes sense! Closing this now.