iqbal-lab-org / clockwork

CRyPTIC data processing pipelines
MIT License
32 stars 22 forks source link

QC pipeline with script only #103

Open khoahoc0508 opened 1 year ago

khoahoc0508 commented 1 year ago

Hello, I want to update the new version, but I can not see how to QC when running only the script without tracking the database. I want to use this as an older version (FastQC and Samtools QC). Please give me a guide so I can QC my data before analysis.

Sincerely, Trung

martinghunt commented 1 year ago

The QC pipeline just runs FASTQC and samtools stats/plot-bamstats. You could run these commands, but they are nothing special, just wrappers around those programs:

clockwork samtools_qc reference.fasta reads.1.fastq reads.2.fastq output_dir

clockwork fastqc outdir reads.1.fastq reads.2.fastq
khoahoc0508 commented 1 year ago

Thank you, @martinghunt; it works flawlessly. Now I can entirely switch new version. Anyway, could you advise on minimum quality requirements for input pair-end files? I am still confused about this.

Sincerely, Trung

martinghunt commented 1 year ago

How you decide a sample is bad and remove it is up to you :) There's no set method of doing so and it depends on what analysis you're doing.

You could remove samples up front, eg if (making up example numbers) <90% of the genome has coverage >20X. Or if a low % of reads map or the reads are low quality (eg error rate from samtools).

You could remove samples after variant calling, eg for TB if a sample has >10k variants, or if it has a lot of "heterozygous" calls (both those things suggest contamination).

khoahoc0508 commented 1 year ago

Thanks very much, @martinghunt. These recommendations are helpful for me. I already used clockwork when it was a part of sp3 platform developed by Oxford University, but now this platform is going down, so I follow step by step their workflows, but something I can not handle.

Sincerely, Trung