harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Add Library ID param to BWA map rule #2

Closed cademirch closed 3 years ago

cademirch commented 3 years ago

Hi all - slightly reworked the params of the BWA map rule to specify the library id in the read group via a csv file relating sample name to library id. Wrote a small function in helperFun to read the csv file and store its info as a dictionary.

Tested these changes on our cluster with Slurm and verified the read group tags included the library id as specified by the csv file. I've included a example csv.

Let me know what you all think, as well as other features I can work on.

tsackton commented 3 years ago

Cade,

This is a good start, however after some discussion today it we decided it would be good to add a few more pieces to this.

  1. The most important step is that we need to extract all the sample information (sample id, run/fastq, library, reference genome) from the sample sheet. Currently the pipeline detects samples from the names of the fastq files in the fastq directory, and the reference genome from the name of the fasta file in the reference directory. Since your code just uses the metadata file for the library id, if there is a conflict between the fastq files in the directory and the sample sheet, it will break. To resolve this, we should change the code to extract samples and the reference genome so that it reads these from the sample sheet as well.

  2. Secondarily, once we do that we need to check to make sure that the files exist in the right directories before proceeding. The most complete way to do this would be to write a fastq downloaded function and a genome downloaded function, using sra toolkit for the fastq download and just constructing a url for wget for the genome download. If these are written as snakemake rules, then if the files don't exist it will automatically fetch them.

  3. Finally, since we get sample names from the fastq file, we assume right now that one sample = one fastq. But that is often not correct for sra data, where multiple runs may represent the same biosample. So, the bwa mem step needs to be updated to produce a bam for each run, and then the bams need to be merged (if there are more than one per run) before dedup. Most efficient would likely be to add sort to the mapping step and then replace sort with merge + index prior to dedup.

I will bounce this pull request back for now and we can work on this roadmap in this fork until it is finalized.

Edited to add: if you keep pushing new commits to your fork it should automatically update the pull request.

cademirch commented 3 years ago

Hey Tim,

This all sounds good. Didn't want to mess with the flow of the workflow (how sample names were obtained) to begin with, but you're right it makes the most sense to use the sample sheet to determine sample name, library id, etc. Will work on implementing that. Would be great to confirm the format of the samples sheet, is what I have in samples.csv correct?

tsackton commented 3 years ago

Hi Cade,

You can find an example of the sample sheet format here: https://github.com/sjswuitchik/compPopGen_ms/blob/master/SRA/cleaned-metadata/Amphilophus_amarillo_run_metadata.csv

But only the first four columns (sample, library id, reference genome, and SRA run id) actually matter for this, I think.

cademirch commented 3 years ago

Perfect, thanks!

cademirch commented 3 years ago

Hey Tim,

I'll commit my changes soon, but I'm trying to figure out how a sample with multiple SRRs will be represented in the sample sheet. Will there be multiple entries for a sample like so:

samp1, lib1, ref, SRR1
samp1, lib1, ref, SRR2
samp2, lib2, ref, SRR3
...
tsackton commented 3 years ago

Yes, each unique SRR gets its own line.

cademirch commented 3 years ago

Great. Thanks.

cademirch commented 3 years ago

Hey Tim,

Got the workflow running with samples that have multiple SRRs as well as those that have one. Tested with samples.csv (in commit).

tsackton commented 3 years ago

Hi Cade,

This looks very nice. One comment/question: am I just missing it, or is the get_ref_link function not used anywhere? It looks like the ncbi datasets is doing all the work to download the reference.

If that is the case, can you clean up the pieces of code not needed, test again, and then I'll approve the merge?

Tim

cademirch commented 3 years ago

Hi Tim,

Removed get_ref_link. It was used for downloading the reference before I found NCBI datasets. Also tested again and all seems well. Working on writing a small test script to automate testing going forward, but I can make another PR for that later.