sanjaynagi / AmpSeeker

A snakemake workflow for amplicon sequencing
https://sanjaynagi.github.io/AmpSeeker/
0 stars 3 forks source link

implement bcl to fastq conversion #15

Closed sanjaynagi closed 1 year ago

sanjaynagi commented 1 year ago

Need a rule at the start of the workflow to convert and demultiplex BCL files from the Illumina miseq output directory to fastq. The command should be something like -

bcl-convert --bcl-input-directory {illumina_out_dir} --output-directory resources/reads --sample-sheet {illumina_out_dir}/SampleSheet.csv
ChabbyTMD commented 1 year ago

Hi Sanjay, I'll get right on this

sanjaynagi commented 1 year ago

Any luck with this yet @ChabbyTMD ?

ChabbyTMD commented 1 year ago

Hi Sanjay,

I've hit a bit of a snag with this. bcl-convert produces fastq reads in the format:

_S#_L00#_R#_001.fastq.gz while the alignment rules expect fastq reads in the format: _#.fastq.gz where "#" is either 1,2. The situation I'm facing is constructing a wildcard for the sample number `S#` in the rule bcl_convert below. My idea was to use the pandas expression `sample_number=samples.str.extract('(\d+)')` to extract the sample number from the corresponding sampleID string e.g. sample99 would be 99. At least in theory. Upon executing this I'm only getting 0 for the sample numbers and the workflow fails as it can not identify the reads to rename them into the proper format needed by the other pipeline rules. Below is my rule file. I would appreciate your guidance on this. rule bcl_convert: input: sample_csv = config["illumina_dir"] + "SampleSheet.csv", output: reads = expand("resources/reads/{sample}_S{sample_number}_L001_R{n}_001.fastq.gz", n=[1,2], sample=samples, sample_number=samples.str.extract('(\d+)')) log: "logs/bcl_convert.log" params: illumina_in_dir = config["illumina_dir"], output_reads = config["read_dir"], run: shell("bcl-convert --bcl-input-directory {params.illumina_in_dir} --force --output-directory {params.output_reads} --sample-sheet {input.sample_csv} 2> {log}") rule rename_fastq: input: fastq_reads = expand("resources/reads/{sample}_S{sample_number}_L001_R{n}_001.fastq.gz", n=[1,2], sample=samples, sample_number=samples.str.extract('(\d+)')) output: reads = expand("resources/reads/{sample}_{n}.fastq.gz", n=[1,2], sample=samples) log: "logs/rename_fastq.log" params: shell: "mv {input.fastq_reads} {output.reads} 2> {log}"
sanjaynagi commented 1 year ago

Good point Trevor, I had completely forgotten that we need to rename the files. Let me find my old commands for how I did this.

sanjaynagi commented 1 year ago

Try these inside a shell command. It may be necessary to install rename or add it to the conda environment.

shell:
  """
  rename s/S[[:digit:]]\+_L001_R// path/to/fastqs/* " 2> {log}
  rename s/_001// path/to/fastqs/* 2>>{log}
  ""

Had to look this up, but the first command gets us to somewhere like this sample95_1_001.fastq.gz and the second command removes that final _001. This is tricky in general and we have to hope files are always named like this. Its a bit poor that we cant customise the output of BCL-convert but nevermind.

In terms of the snakemake input, I suggest we just designate the input as a whole directory results/reads, then the output can be something like expand("{sampleID}_{n}.fastq.gz", sampleID=samples, n=[1,2]). It may mean that the output of the bcl-convert rule is directory("results/reads/").Btw, that rule should not have the run statement as that's for running python code, just use shell instead.

Also, because the reads are now something we generate as part of the workflow, I think we should move the resources/reads folder to results/reads (throughout the workflow). This is useful as when you want to start the workflow from the beginning, you can just delete or move the whole results/ folder

FYI, Ive added information on how to contribute to AmpSeeker in the README as its been a while :) Thanks Trevor.