moiexpositoalonsolab / grenepipe

A flexible, scalable, and reproducible pipeline to automate variant calling from raw sequence reads, with lots of bells and whistles.
http://grene-net.org
GNU General Public License v3.0
93 stars 21 forks source link

FastQC on one set of reads only #10

Closed bensprung closed 3 years ago

bensprung commented 3 years ago

I'm not sure if this is a bug or if I'm misinterpreting something. I have paired-end reads. In the trimmomatic logs I can see we are indeed using both fq1 and fq2:

TrimmomaticPE: Started with arguments:
 -threads 1 /home/ben/Synced/research/mtDNA_copy_number/sequencing_2021-06-07/illumina/BGS5_S162_R1_001.fastq.gz /home/ben/Synced/research/mtDNA_copy_number/sequencing_2021-06-07/illumina/BGS5_S162_R2_001.fastq.gz trimmed/BGS5-1.1.fastq.gz trimmed/BGS5-1.1.unpaired.fastq.gz trimmed/BGS5-1.2.fastq.gz trimmed/BGS5-1.2.unpaired.fastq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Quality encoding detected as phred33
Input Read Pairs: 3550692 Both Surviving: 3519199 (99.11%) Forward Only Surviving: 18163 (0.51%) Reverse Only Surviving: 7047 (0.20%) Dropped: 6283 (0.18%)
TrimmomaticPE: Completed successfully

However the fastqc log only the first file appears:

(base) ben@OptiPlex990:~/grenepipe_BGS5_W303/logs/fastqc$ cat BGS5-1.log 
Input:       /home/ben/Synced/research/mtDNA_copy_number/sequencing_2021-06-07/illumina/BGS5_S162_R1_001.fastq.gz
Output zip:  qc/fastqc/BGS5-1_fastqc.zip
Output html: qc/fastqc/BGS5-1_fastqc.html

--

Started analysis of BGS5_S162_R1_001.fastq.gz
Approx 5% complete for BGS5_S162_R1_001.fastq.gz
Approx 10% complete for BGS5_S162_R1_001.fastq.gz
Approx 15% complete for BGS5_S162_R1_001.fastq.gz
Approx 20% complete for BGS5_S162_R1_001.fastq.gz
Approx 25% complete for BGS5_S162_R1_001.fastq.gz
Approx 30% complete for BGS5_S162_R1_001.fastq.gz
Approx 35% complete for BGS5_S162_R1_001.fastq.gz
Approx 40% complete for BGS5_S162_R1_001.fastq.gz
Approx 45% complete for BGS5_S162_R1_001.fastq.gz
Approx 50% complete for BGS5_S162_R1_001.fastq.gz
Approx 55% complete for BGS5_S162_R1_001.fastq.gz
Approx 60% complete for BGS5_S162_R1_001.fastq.gz
Approx 65% complete for BGS5_S162_R1_001.fastq.gz
Approx 70% complete for BGS5_S162_R1_001.fastq.gz
Approx 75% complete for BGS5_S162_R1_001.fastq.gz
Approx 80% complete for BGS5_S162_R1_001.fastq.gz
Approx 85% complete for BGS5_S162_R1_001.fastq.gz
Approx 90% complete for BGS5_S162_R1_001.fastq.gz
Approx 95% complete for BGS5_S162_R1_001.fastq.gz
Analysis complete for BGS5_S162_R1_001.fastq.gz

I noticed this because in MultiQC there seems to be one line for the sample as a whole, then another line for fq1, but no line for fq2:

image

lczech commented 3 years ago

Hi @bensprung,

you are right, for fastqc we are currently only using the R1 fastq file as input. This is mostly because so far (at least in our tests), the results for R2 are kind of the same, so they do not add much information, and also because I was a bit lazy when coding this - it's a bit tricky to change the current behavior and add R2 as well. This is because the settings for fastqc also allow to use the trimmed reads instead of the raw samples as input, and in turn, the trimmed reads can also be merged into a single fastq file. Hence, there is a combination of settings where the data has R1 and R2, but we'd still only want to use one file as input - and I just avoided coding that so far ;-)

If you think it's worth having fastqc run on R2 in the case that raw sample data is used or trimming without merging is used as input for fastqc, I can add that though. Let me know, it's tricky, but doable :-)

Cheers Lucas

lczech commented 3 years ago

Okay okay, I couldn't resist and coded this now. Please download the latest grenepipe, which now supports FastQC on both paired end reads, as well as on the trimmed (and potentially merged) files as well, depending on the FastQC settings in the config file.

Please let me know if that is what you needed and works for you.

So long Lucas

bensprung commented 3 years ago

Ha, thank you! It looks great. I was going to say, I didn't want to create extra work. But I do have a good (post-hoc now) case for doing this. I had somehow added trailing tabs to some of my samples.tsv files, causing only one set of reads to be processed, which I didn't notice for a while. I might have noticed it faster if the typical multiQC view had stats for both files, as it does now with this change. Very nice!

lczech commented 3 years ago

I was going to say, I didn't want to create extra work.

No worries - I want to create a tool that is actually useful, and my use case might differ from yours, so thanks for your input on what kind of tools and things are actually needed in practice :-)

I had somehow added trailing tabs to some of my samples.tsv files, causing only one set of reads to be processed, which I didn't notice for a while.

Oh, that is unexpected - could you explain what happened there in some more detail please? Do you have single or paired end end reads (one or two columns in the table)? How does a tab cause the trouble there? That should get fixed, as this seems like an easy to make mistake.

bensprung commented 3 years ago

Yep I had paired end reads. I (mistakenly) formatted my samples.tsv like this:

sample\tunit\tplatform\tfq1\tfq2
sample_name\t1\tILLUMINA\t/path/to/fq1\tpath/to/fq2\t

So there's an extra tab at the end of the first non-header row. It seemed to cause a one-off error in what field was interpreted as what, but the pipeline still ran with fq2 playing the role of fq1, and no fq2. I think!

bensprung commented 3 years ago

p.s. as long as I have you on the line, a couple things I have noticed about snpEff:

(1) the download-dir parameter in config.yaml must have a trailing slash to work, e.g. "/home/ben/my_download_dir/" works but "/home/ben/my_download_dir" does not. Not sure if that is the intended behavior (maybe just a warning about this in the comments would suffice?).

(2) if feasible, it might be good to support custom databases for snpEff. I am new to using it but so far as I can tell the yeast databases it supplies are kind of out of date, it has R64-1 while the latest is R64-3

lczech commented 3 years ago

Hm weird. I'm using pandas to read that table - no idea what it does with those extra tabs... Apparently not the right thing though. Might look into this at some point.

As for your PS: Good points! Would you mind opening issues for those as well?

Closing this one here for now - feel free to re-open as needed, and see you in the other issues then (no worries, I'll be on the line here for a while).