rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Add multi sample capability. #71

Closed LeeBergstrand closed 10 months ago

LeeBergstrand commented 10 months ago

Adds preliminary multi-sample capability.

LeeBergstrand commented 10 months ago

@jmtsuji Hey, could you review this visually / on your server and see if you can find any bugs. I assume there will be when I finally run it, I'm updating my test server environment tomorrow but I wanted you to give a heads up for review to pick out any big errors.

--dry-run works and makes the right copies of rules to run.

LeeBergstrand commented 10 months ago

So at the top of the snakemake file I create a dictionary of sample objects, keyed by identifiers, representing each sample.

I also assigned a list at the top called SAMPLE_NAMES that contains the sample identifier list.

I added a rule called set_up_sample_directories that takes this dictionary and uses it to create all the sample output directories and symlink the input fastq files to directories (if they aren't gzipped I gzip them here).

for this rule and each of the checkpoint rules I used expand to get the list of output files based on the sample identifiers.

expand("{sample}/{sample}_long.fastq.gz", sample=SAMPLE_NAMES) # This expands out to a list of long fastq files paths for each sample.

I added {sample}/ wild cards to all the paths.

for some of your existing expand() I added {{sample}}/. the second braces tells the expand not to fill in sample. So you get an expansion but each expanded path starts with {sample}/which is later interpreted as a wildcard.

In rules params: cannot access wildcards directly so I either moved them into input or output or manually accessed the wild card with a lambda function.

strain=lambda wildcards: wildcards.sample # Get sample name from wildcards.

I not sure if this is the proper way to access them but we will see at runtime.

LeeBergstrand commented 10 months ago

I also moved some of your checks whether or not to use short read polishing to its own variable, POLISH_WITH_SHORT_READS, at the top of the the file. I think we could add this as a config variable down the road.

I estimate that it might be tricky to figure out the logic of short read polish or not on a per sample basis would be tricky and might need a significant rewrite of the switching logic.

We also need to check that I didn't screw up the logic.

jmtsuji commented 10 months ago

As one other note, I haven't been able to commit new rules yet for short read QC, although I have some draft code started. Sorry for the delay here! Might it work well for me to commit the short read QC rules directly to the multi-sample version of the workflow, given that the multi-sample version is already in reasonable shape? For example, next week I could make a fork of this branch and use it to add the short read QC rules. Thanks!

LeeBergstrand commented 10 months ago

As one other note, I haven't been able to commit new rules yet for short read QC, although I have some draft code started. Sorry for the delay here! Might it work well for me to commit the short read QC rules directly to the multi-sample version of the workflow, given that the multi-sample version is already in reasonable shape? For example, next week I could make a fork of this branch and use it to add the short read QC rules. Thanks!

@jmtsuji This sounds good to me. Feel free to branch and merge into this branch.

LeeBergstrand commented 10 months ago

TODO: I also think we should add more {sample} wild cards to the output file names so you can differentiate files from each other. For example, name the different assembly.fasta files in different sample folders with their sample names (e.g. bs4_assembly.fasta).

LeeBergstrand commented 10 months ago

@jmtsuji New updates to this branch to fix some preliminary bug fixes I encountered.

jmtsuji commented 10 months ago

TODO: I also think we should add more {sample} wild cards to the output file names so you can differentiate files from each other. For example, name the different assembly.fasta files in different sample folders with their sample names (e.g. bs4_assembly.fasta).

Agreed -- this sounds like a nice idea. For now, I will focus on end-to-end testing, but let's keep this in mind when doing edits to the snakefile down the road.

Update: I see that you added more {sample} wildcards to the multi-pathing branch -- thanks!