pinellolab / CRISPResso2

Analysis of deep sequencing data for rapid and intuitive interpretation of genome editing experiments
Other
268 stars 92 forks source link

CRISPRessoWGS return empty fastq files and summary results #163

Closed clytiaxu closed 2 years ago

clytiaxu commented 2 years ago

Hi there, I am using the CRISPRessoWGS function to evaluate the genome-wide off-target events in CRISPR/Cas9. Aligned and sorted bam file of whole-genome sequencing data and potential off-target regions were used as input for CRISPRessoWGS.

The code I used was: CRISPRessoWGS -p 40 --min_reads_to_use_region 3 -b 1-1.sorted.bam -r hg38.fa -f test.bed --name test --skip_failed

However, the indel or substitution events in the result file were always "NA". And I checked the ANALYZED_REGIONS filefold, and found that most of targeted regions had correct bam files, but the corresponding fastq.gz files were all empty. I guessed that some problems happened during the transfromation of bam to fastq, but I am not familiar with python scripts and cannot find where the bug was. Was there any problem with my code? Or some other reasons caused the erro? It would be great help if you can give some advices. Many thanks.

kclem commented 2 years ago

Hi @clytiaxu,

Thanks for using CRISPResso!

CRISPRessoWGS only considers reads that completely overlap the regions in test.bed.

The fastq.gz files could be empty for one of three reasons: 1) The regions in test.bed specify regions in a different format than the input bam file. E.g. if your bam file has reads aligned to a genome with 'chr1', 'chr2', 'chr3'... but your region file specifies '1', '2', '3', etc. 2) There are no reads that overlap the region in test.bed. I.e. there are simply no reads aligned to those regions. 3) Of the reads that overlap the region in test.bed, none of the reads completely overlap the region. This is usually due to the region being longer than the sequencing read length.

Can you try the following: 1) For a region that has no reads check to see if there are any reads there. For example, if my region from test.bed is chr1 444 555 AAAAAA you can use samtools to check to see if reads are aligned there in your input sample.bam file using the command: samtools view -c sample.bam chr1:444-555. The -c flag will return the count of reads there. If the count is 0, make sure that you are specifying the correct region coordinates and in the correct format. 2) If the test.bed region is too long, reads will be discarded because they don't completely cover the region. Try shrinking your region sizes to 10bp and running with that small region file. Then try increasing your region files to something around 20-50bp. As you increase your region size, fewer reads will completely cover the region, so you'll have to balance region length with the number of reads you have in your analysis. I usually find that 20-50bp is a good region size.

Let me know if those help!

clytiaxu commented 2 years ago

Dear Kendell, Thanks very much for your detailed reply! I tested my data following your suggestions, and found that the erro was caused by the region size in my test.bed. I set the region size as 200bp (and my reads were 150bp paired-end), which caused no reads could span this region. The CRISPRessoWGS now works fine after I changed the region size to 20~30bp. Thanks again for your help! Clytia Xu 2021-10-22

@.***

From: Kendell Clement Date: 2021-10-22 09:41 To: pinellolab/CRISPResso2 CC: clytiaxu; Mention Subject: Re: [pinellolab/CRISPResso2] CRISPRessoWGS return empty fastq files and summary results (Issue #163) Hi @clytiaxu, Thanks for using CRISPResso! CRISPRessoWGS only considers reads that completely overlap the regions in test.bed. The fastq.gz files could be empty for one of three reasons: The regions in test.bed specify regions in a different format than the input bam file. E.g. if your bam file has reads aligned to a genome with 'chr1', 'chr2', 'chr3'... but your region file specifies '1', '2', '3', etc. There are no reads that overlap the region in test.bed. I.e. there are simply no reads aligned to those regions. Of the reads that overlap the region in test.bed, none of the reads completely overlap the region. This is usually due to the region being longer than the sequencing read length. Can you try the following: For a region that has no reads check to see if there are any reads there. For example, if my region from test.bed is chr1 444 555 AAAAAA you can use samtools to check to see if reads are aligned there in your input sample.bam file using the command: samtools view -c sample.bam chr1:444-555. The -c flag will return the count of reads there. If the count is 0, make sure that you are specifying the correct region coordinates and in the correct format. If the test.bed region is too long, reads will be discarded because they don't completely cover the region. Try shrinking your region sizes to 10bp and running with that small region file. Then try increasing your region files to something around 20-50bp. As you increase your region size, fewer reads will completely cover the region, so you'll have to balance region length with the number of reads you have in your analysis. I usually find that 20-50bp is a good region size. Let me know if those help! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

kclem commented 2 years ago

Good to hear!

Happy analyzing!