N characters introduced into *indels.csv

alan-tracey commented 3 months ago

Description of the bug

Hi, I’ve just run crisprseq using the targeted pipeline with a read1.fastq.gz only. I heavily quality filtered the input reads, removing any reads containing N characters. In the output indels.csv, there are many cases of N characters being reported in the "pre_ins_nt", "ins_nt" and "post_ins_nt" columns. When I check these reads in the input fastq file, the reported N characters are [ACGT] characters with Q value > 30. For the handful of reads I’ve looked at with these reported N characters, the majority called insertion (normal ACGT sequence) can be found in the input sequence, further suggesting these N calls could be erroneous results. My data is confidential so I unfortunately cannot share it. However, I notice that in the test dataset output, there are N's reported in some of the insertion outcomes which don't occur in the input reads, eg M00724:1:000000000-DC7GJ:1:1102:19229:3583 in hCas9-TRAC-a_R*.fastq.gz - this has AGA-N-CAT.

Command used and terminal output

No response

Relevant files

No response

System information

No response

mirpedrol commented 2 months ago

Hello @alan-tracey, thanks for reporting this. I had a look at the hCas9-TRAC sample from the test data and in this case the masked bases are due to bad quality. Even if the original reads have good quality, we use pear to join R1 and R2 reads, this computes a new quality score based on the overlapping bases, if this base is not the same for R1 and R2, the new quality will be lower. You can check this assembled fastq files from the output directory preprocessing/pear to make sure that this is the same that happens with your samples.

alan-tracey commented 2 months ago

Hi Júlia

In my case I don't think that explains it since I am not using paired end sequencing, rather I am using just R1 (R2 is only used to capture a barcode sequence and is then discarded).

Thanks, Alan

On Fri, 19 Jul 2024 at 17:01, Júlia Mir Pedrol @.***> wrote:

Hello @alan-tracey https://github.com/alan-tracey, thanks for reporting this. I had a look at the hCas9-TRAC sample from the test data and in this case the masked bases are due to bad quality. Even if the original reads have good quality, we use pear https://cme.h-its.org/exelixis/web/software/pear/doc.html to join R1 and R2 reads, this computes a new quality score based on the overlapping bases, if this base is not the same for R1 and R2, the new quality will be lower. You can check this assembled fastq files from the output directory preprocessing/pear to make sure that this is the same that happens with your samples.

— Reply to this email directly, view it on GitHub https://github.com/nf-core/crisprseq/issues/162#issuecomment-2239513921, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2SZGSDVDJO6EGHDFXJW56DZNEZ6NAVCNFSM6AAAAABKLBD672VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGUYTGOJSGE . You are receiving this because you were mentioned.Message ID: @.***>

-- Alan Tracey Bioinformatician T +44 (0)1223 787297 @. @.>

The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom @. @.> | www.bit.bio Follow us https://twitter.com/bitbio https://www.linkedin.com/company/bitbioltd/ [image: bit.bio] http://www.bit.bio/ Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.

Alan Tracey Bioinformatician T +44 (0)1223 787297 @.***

The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom @.*** | www.bit.bio Follow us Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.

mirpedrol commented 2 months ago

Could you check if the Ns are actually added by seqtk? You can find the output files after this tool in preprocessing/seqtk. By default we are using the parameter -q 20 -L 80 -n N for seqtk, which should mask bases with a quality lower than 20, are you modifying these parameters, or running the pipeline with all the defaults?

alan-tracey commented 2 months ago

Hi Júlia

I've checked and there are reads in preprocessing/seqtk that contain 'N's. I have run the pipeline with default settings and --overrepresented.

Thanks, Alan

On Fri, 19 Jul 2024 at 17:12, Júlia Mir Pedrol @.***> wrote:

Could you check if the Ns are actually added by seqtk? You can find the output files after this tool in preprocessing/seqtk. By default we are using the parameter -q 20 -L 80 -n N for seqtk, which should mask bases with a quality lower than 20, are you modifying these parameters, or running the pipeline with all the defaults?

— Reply to this email directly, view it on GitHub https://github.com/nf-core/crisprseq/issues/162#issuecomment-2239529751, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2SZGSABTIW3KCVCY6VOLPTZNE3FBAVCNFSM6AAAAABKLBD672VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGUZDSNZVGE . You are receiving this because you were mentioned.Message ID: @.***>

-- Alan Tracey Bioinformatician T +44 (0)1223 787297 @. @.>

The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom @. @.> | www.bit.bio Follow us https://twitter.com/bitbio https://www.linkedin.com/company/bitbioltd/ [image: bit.bio] http://www.bit.bio/ Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.

Alan Tracey Bioinformatician T +44 (0)1223 787297 @.***

The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom @.*** | www.bit.bio Follow us Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.

mirpedrol commented 2 months ago

Is the quality of those Ns higher than 20?

mirpedrol commented 2 months ago

If you are using --overrepresented, the input reads to seqtk are under <outdir>/preprocessing/cutadapt, could you doublecheck if the same reads which contain Ns after seqtk, also contain these Ns after cutadapt and not in the input raw fastq files? Thanks for helping with this debugging :)

alan-tracey commented 2 months ago

It looks like there are N bases with quality <20 (here comparing seqtk vs cutadapt as you suggested):

zgrep -A4 "M07996:142:000000000-LKRPM:1:1101:15613:1876" S04B2_CIITA.seqtk-seq.fastq.gz

@M07996:142:000000000-LKRPM:1:1101:15613:1876 1:N:0:GCCTTCGGGA+CCCACGATTT

GGTGACTGAGCATTGTCTTCCCTCCCAGGCAGCTCACAGTGTGCCACCNNGGANTTGGGGCCCCTAGAAGGTGGCTTACCTGGAGCTTCTTAACAGCGATGCTGACCCCGTGTGCCTCTACCACTTCTATNACCNNNTGGN

+

@.*** <1==G1..<GH/

@M07996:142:000000000-LKRPM:1:1101:17082:1937 1:N:0:GCCTTCGGTA+CCAACGATTT

(base) @.*** Downloads % zgrep -A4 "M07996:142:000000000-LKRPM:1:1101:15613:1876" S04B2_CIITA.trim.fastq.gz

@M07996:142:000000000-LKRPM:1:1101:15613:1876 1:N:0:GCCTTCGGGA+CCCACGATTT

GGTGACTGAGCATTGTCTTCCCTCCCAGGCAGCTCACAGTGTGCCACCATGGAGTTGGGGCCCCTAGAAGGTGGCTTACCTGGAGCTTCTTAACAGCGATGCTGACCCCGTGTGCCTCTACCACTTCTATGACCAGATGGA

+

@.*** <1==G1..<GH/

@M07996:142:000000000-LKRPM:1:1101:14827:1933 1:N:0:GCCTTCGGTA+CCAACGATTT

On Fri, 19 Jul 2024 at 17:26, Júlia Mir Pedrol @.***> wrote:

If you are using --overrepresented, the input reads to seqtk are under
/preprocessing/cutadapt, could you doublecheck if the same reads which contain Ns after seqtk, also contain these Ns after cutadapt and not in the input raw fastq files? Thanks for helping with this debugging :) — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Alan Tracey Bioinformatician T +44 (0)1223 787297 @. @.>

The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom @. @.> | www.bit.bio Follow us https://twitter.com/bitbio https://www.linkedin.com/company/bitbioltd/ [image: bit.bio] http://www.bit.bio/ Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.

Alan Tracey Bioinformatician T +44 (0)1223 787297 @.***

The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom @.*** | www.bit.bio Follow us Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.

nf-core / crisprseq