pinellolab / CRISPResso2

Analysis of deep sequencing data for rapid and intuitive interpretation of genome editing experiments
Other
277 stars 95 forks source link

CRISPRessoPooled failed to Demultiplex reads #115

Closed KellenX closed 3 years ago

KellenX commented 3 years ago

Describe the bug Hello, we are using CRISPRessoPooled(amplicon mode), to map reads to multiple amplicons and analyze base editing. However, no reads was mapped to any of my five reference amplicon sequences, therefore CRISPResso ended with "Skipping amplicon [*] because no reads align to it"

However, when I tried to separate reads into five pairs of FASTQ files with homemade script then run CRISPResso on each pair, CRISPResso worked perfectly.

With --keep_intermediate, AMPLICONS.fa captured all my five reference amplicon sequences, but MAPPING_STATISTICS.txt says: """ READS IN INPUTS:51049 READS AFTER PREPROCESSING:50004 READS ALIGNED:0 """

Five .fastq.gz files were generated but were all empty (size = 0b)

Expected behavior Pooled reads get sorted into five FASTQ files.

To reproduce Python 2.7.15 [CRISPResso version 2.1.2] installed on July 9, 2021 with "$ conda create -n crispresso2_env -c bioconda crispresso2 python=2.7"

The shell command: "$ CRISPRessoPooled -r1 VT01-mix5_R1_001.fastq.gz -r2 VT01-mix5_R2_001.fastq.gz -f VT01-mix5.txt -n VT01-mix5 -o crispressoPooled_BE_VT01-mix5 --base_editor_output --conversion_nuc_from C --conversion_nuc_to T -w 10 -wc -10"

In VT01-mix5.txt: """ Site2 CCTGGCTGAGCTAACTGTGACAGCATGTGGTAATTTTCCAGCCCGCTGGCCCTGTAAAGGAAACTGGAACACAAAGCATAGACTGCGGGGCGGGCCAGCCTGAATAGCTGCAAACAAGTGCAGAATATCTGATGATGTCATACGCACAGTTTGAC GAACACAAAGCATAGACTGC NA NA Site4 CTGGGTGGAAGGAAGGGAGGAAGGGCGAGGCAGAGGGTCCAAAGCAGGATGACAGGCAGGGGCACCGCGGCGCCCCGGTGGCACTGCGGCTGGAGGTGGGGGTTAAAGCGGAGACTCTGGTGCTGTGTGACTACAGTGGGGGCCCTGCCCTCTCTGAGCCCCCGCCTCCAGGCCTGTGTGTGTGTCTCCGTTCGGGTTGA GGCACTGCGGCTGGAGGTGG NA NA HBG115 ccccttccccacactatctcaatgcaaatatctgtctgaaacggtccctggctaaactccacccatgggttggccagccttgccttgaccaatagccttgacaaggcaaacttgaccaatagtcttagagtatccagtgaggccaggggccggcggctggctagggatgaagaat cttgaccaatagccttgaca NA NA EMX GCGGCTGCGACCATGTTCCAGCCCGCGGCCAAGCGCGGCTTTACCATAGAGTCCTTGGTGGCCAAGGACGGCGGCACCGGCGGGGGCACTGGCGGCGGGGGCGCGGGCTCCCATCTCCTGGCGGCGGCCGCCTCCGAGGAACCGCTCCGGCCCACGGCGCTCAACTACCCTCACCCCAG AAGGACGGCGGCACCGGCGG NA NA FGF6 CTGCTGACATGAAACCAAAGCCTCCATCGGGCACTCGGGTTGAGAGCAGAGGGACCCAGGCTGAGCCGCGGCCGGTAGAGACCATGGCTCGGGGACGCTCTCTAGCTCGCCCGCTTGCTCCGTCCCTAGTTGATGATATTTGATTTGACCTATCTTTATAGTTGCTTCAGCAGCCCGGCCCTCCCCTCCACCCATCATCGCCCTGACGTCAACCCGCCCAGCTCACTTATCCAGGGCTGTAAC CCCGCTTGCTCCGTCCCTAG NA NA """

I'd like to send the input files VT01-mix5_R1_001.fastq.gz, VT01-mix5_R2_001.fastq.gz and intermediate files to you by email if you need them.

Debug output Paste the entire output when you run CRISPResso with the flag --debug.

""" $ CRISPRessoPooled --base_editor_output --conversion_nuc_from C --conversion_nuc_to T -w 10 -wc -10 -r1 VT01-mix5_R1_001.fastq.gz -r2 VT01-mix5_R2_001.fastq.gz -f VT01-mix5.txt -n VT01-mix5 -o crispressoPooled_BE_VT01-mix5 --keep_intermediate --debug

                         ~~~CRISPRessoPooled~~~
  -Analysis of CRISPR/Cas9 outcomes from POOLED deep sequencing data-

           _                                                 _
          '  )                                              '  )
          .-'           _______________________             .-'
         (____         | __  __  __     __ __  |           (____
      C)|     \        ||__)/  \/  \|  |_ |  \ |        C)|     \
        \     /        ||   \__/\__/|__|__|__/ |          \     /
         \___/         |_______________________|           \___/

                       [CRISPResso version 2.1.2]

[Note that starting in version 2.1.0 insertion quantification has been changed to only include insertions completely contained by the quantification window. To use the legacy quantification method (i.e. include insertions directly adjacent to the quantification window) please use the parameter --use_legacy_insertion_quantification] [For support contact kclement@mgh.harvard.edu]

INFO @ Thu, 22 Jul 2021 16:17:27: Checking dependencies...

INFO @ Thu, 22 Jul 2021 16:17:27: All the required dependencies are present!

INFO @ Thu, 22 Jul 2021 16:17:27: Only the Amplicon description file was provided. The analysis will be perfomed using only the provided amplicons sequences.

INFO @ Thu, 22 Jul 2021 16:17:27: Creating Folder /cache/home/Data20210512/crispressoPooled_BE_VT01-mix5/CRISPRessoPooled_on_VT01-mix5

WARNING @ Thu, 22 Jul 2021 16:17:27: Folder /cache/home/Data20210512/crispressoPooled_BE_VT01-mix5/CRISPRessoPooled_on_VT01-mix5 already exists.

INFO @ Thu, 22 Jul 2021 16:17:27: Processing input

INFO @ Thu, 22 Jul 2021 16:17:27: Merging paired sequences with Flash...

INFO @ Thu, 22 Jul 2021 16:17:28: Done!

INFO @ Thu, 22 Jul 2021 16:17:29: Creating a custom index file with all the amplicons...

INFO @ Thu, 22 Jul 2021 16:17:29: Align reads to the amplicons...

INFO @ Thu, 22 Jul 2021 16:17:29: Alignment command: bowtie2 -x /cache/home/Data20210512/crispressoPooled_BE_VT01-mix5/CRISPRessoPooled_on_VT01-mix5/CUSTOM_BOWTIE2_INDEX -p 1 --end-to-end -N 0 --np 0 --mp 3,2 --score-min L,-5,-1.2 -U /cache/home/Data20210512/crispressoPooled_BE_VT01-mix5/CRISPRessoPooled_on_VT01-mix5/out.extendedFrags.fastq.gz 2>>/cache/home/Data20210512/crispressoPooled_BE_VT01-mix5/CRISPRessoPooled_on_VT01-mix5/CRISPRessoPooled_RUNNING_LOG.txt | samtools view -bS - > /cache/home/Data20210512/crispressoPooled_BE_VT01-mix5/CRISPRessoPooled_on_VT01-mix5/CRISPResso_AMPLICONS_ALIGNED.bam

INFO @ Thu, 22 Jul 2021 16:17:29: Demultiplex reads and run CRISPResso on each amplicon...

INFO @ Thu, 22 Jul 2021 16:17:29:

Processing:Site2

gzip: stdin: unexpected end of file WARNING @ Thu, 22 Jul 2021 16:17:29: Skipping amplicon [Site2] because no reads align to it

INFO @ Thu, 22 Jul 2021 16:17:29:

Processing:Site4

gzip: stdin: unexpected end of file WARNING @ Thu, 22 Jul 2021 16:17:29: Skipping amplicon [Site4] because no reads align to it

INFO @ Thu, 22 Jul 2021 16:17:29:

Processing:HBG115

gzip: stdin: unexpected end of file WARNING @ Thu, 22 Jul 2021 16:17:29: Skipping amplicon [HBG115] because no reads align to it

INFO @ Thu, 22 Jul 2021 16:17:29:

Processing:EMX

gzip: stdin: unexpected end of file WARNING @ Thu, 22 Jul 2021 16:17:29: Skipping amplicon [EMX] because no reads align to it

INFO @ Thu, 22 Jul 2021 16:17:29:

Processing:FGF6

gzip: stdin: unexpected end of file WARNING @ Thu, 22 Jul 2021 16:17:29: Skipping amplicon [FGF6] because no reads align to it

INFO @ Thu, 22 Jul 2021 16:17:29: Running CRISPResso with 1 processes

INFO @ Thu, 22 Jul 2021 16:17:29: Finished all amplicons

WARNING @ Thu, 22 Jul 2021 16:17:30: Skipping the folder CRISPResso_on_Site2: not enough reads, incomplete, or empty folder.

WARNING @ Thu, 22 Jul 2021 16:17:30: Skipping the folder CRISPResso_on_Site4: not enough reads, incomplete, or empty folder.

WARNING @ Thu, 22 Jul 2021 16:17:30: Skipping the folder CRISPResso_on_HBG115: not enough reads, incomplete, or empty folder.

WARNING @ Thu, 22 Jul 2021 16:17:30: Skipping the folder CRISPResso_on_EMX: not enough reads, incomplete, or empty folder.

WARNING @ Thu, 22 Jul 2021 16:17:30: Skipping the folder CRISPResso_on_FGF6: not enough reads, incomplete, or empty folder.

WARNING @ Thu, 22 Jul 2021 16:17:30: Less than half (0/51049) of reads aligned to amplicons. Finding most frequent unaligned reads.

WARNING @ Thu, 22 Jul 2021 16:17:30: Perhaps one or more of the given amplicon sequences were incomplete or incorrect. Below is a list of the most frequent unaligned reads (in the first 10000 unaligned reads). Check this list to see if an amplicon is among these reads.

INFO @ Thu, 22 Jul 2021 16:17:30: All Done!

                                   _
                                  '  )
                                  .-'
                                 (____
                              C)|     \
                                \     /
                                 \___/

"""

kclem commented 3 years ago

Hi @KellenX,

Thanks for using CRISPResso!

Have you tried relaxing the --min_reads_to_use_region parameter? The default requires 1000 reads to align to an amplicon before it is analyzed by CRISPResso. If you think that reads are aligning to these regions, try setting this value to a lower number (e.g. --min_reads_to_use_region 1) and see if it processes those reads.

However, more concerning is the error gzip: stdin: unexpected end of file and the fact that your region files are empty (0 bytes). Are your input fastq files gzipped? Or are they plain text files that end in the suffix .gz?

Can you run CRISPResso Pooled on the test data? Download the test inputs here: https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/Cas9.amplicons.txt https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/Both.Cas9.fastq

And if you run CRISPRessoPooled -r1 Both.Cas9.fastq -f Cas9.amplicons.txt -p 2 --keep_intermediate --min_reads_to_use_region 100 do you get output?

KellenX commented 3 years ago

Hi @kclem Thank you for you guidance. However, your sample code didn't work on the platform I'm using.

I downloaded these three test inputs: https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/Cas9.amplicons.txt https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/Both.Cas9.fastq https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/Cas9.regions.txt

Then ran your command CRISPRessoPooled -r1 Both.Cas9.fastq -f Cas9.amplicons.txt -p 2 --keep_intermediate --min_reads_to_use_region 100

This is what I got: $ !CRISPRessoPooled CRISPRessoPooled -r1 Both.Cas9.fastq -f Cas9.amplicons.txt -p 2 --keep_intermediate --min_reads_to_use_region 100

                         ~~~CRISPRessoPooled~~~
  -Analysis of CRISPR/Cas9 outcomes from POOLED deep sequencing data-

           _                                                 _
          '  )                                              '  )
          .-'           _______________________             .-'
         (____         | __  __  __     __ __  |           (____
      C)|     \        ||__)/  \/  \|  |_ |  \ |        C)|     \
        \     /        ||   \__/\__/|__|__|__/ |          \     /
         \___/         |_______________________|           \___/

                       [CRISPResso version 2.1.2]

[Note that starting in version 2.1.0 insertion quantification has been changed to only include insertions completely contained by the quantification window. To use the legacy quantification method (i.e. include insertions directly adjacent to the quantification window) please use the parameter --use_legacy_insertion_quantification] [For support contact kclement@mgh.harvard.edu]

INFO @ Fri, 23 Jul 2021 16:11:26: Checking dependencies...

INFO @ Fri, 23 Jul 2021 16:11:26: All the required dependencies are present!

INFO @ Fri, 23 Jul 2021 16:11:26: Only the Amplicon description file was provided. The analysis will be perfomed using only the provided amplicons sequences.

INFO @ Fri, 23 Jul 2021 16:11:26: Creating Folder CRISPRessoPooled_on_Both.Cas9

INFO @ Fri, 23 Jul 2021 16:11:26: Done!

INFO @ Fri, 23 Jul 2021 16:11:26: Processing input

INFO @ Fri, 23 Jul 2021 16:11:26: Creating a custom index file with all the amplicons...

INFO @ Fri, 23 Jul 2021 16:11:27: Align reads to the amplicons...

INFO @ Fri, 23 Jul 2021 16:11:27: Alignment command: bowtie2 -x CRISPRessoPooled_on_Both.Cas9/CUSTOM_BOWTIE2_INDEX -p 2 --end-to-end -N 0 --np 0 --mp 3,2 --score-min L,-5,-1.2 -U CRISPRessoPooled_on_Both.Cas9/Both.Cas9.fastq 2>>CRISPRessoPooled_on_Both.Cas9/CRISPRessoPooled_RUNNING_LOG.txt | samtools view -bS - > CRISPRessoPooled_on_Both.Cas9/CRISPResso_AMPLICONS_ALIGNED.bam

INFO @ Fri, 23 Jul 2021 16:11:28: Demultiplex reads and run CRISPResso on each amplicon...

INFO @ Fri, 23 Jul 2021 16:11:28:

Processing:FANC

gzip: stdin: unexpected end of file WARNING @ Fri, 23 Jul 2021 16:11:28: Skipping amplicon [FANC] because no reads align to it

INFO @ Fri, 23 Jul 2021 16:11:28:

Processing:HEK3

gzip: stdin: unexpected end of file WARNING @ Fri, 23 Jul 2021 16:11:28: Skipping amplicon [HEK3] because no reads align to it

INFO @ Fri, 23 Jul 2021 16:11:28: Running CRISPResso with 2 processes

INFO @ Fri, 23 Jul 2021 16:11:28: Finished all amplicons

WARNING @ Fri, 23 Jul 2021 16:11:28: Skipping the folder CRISPResso_on_FANC: not enough reads, incomplete, or empty folder.

WARNING @ Fri, 23 Jul 2021 16:11:28: Skipping the folder CRISPResso_on_HEK3: not enough reads, incomplete, or empty folder.

WARNING @ Fri, 23 Jul 2021 16:11:29: Less than half (0/500) of reads aligned to amplicons. Finding most frequent unaligned reads.

WARNING @ Fri, 23 Jul 2021 16:11:29: Perhaps one or more of the given amplicon sequences were incomplete or incorrect. Below is a list of the most frequent unaligned reads (in the first 10000 unaligned reads). Check this list to see if an amplicon is among these reads.

INFO @ Fri, 23 Jul 2021 16:11:29: All Done!

                                   _
                                  '  )
                                  .-'
                                 (____
                              C)|     \
                                \     /
                                 \___/
kclem commented 3 years ago

Hm.. those files work for me on a couple of systems and should work for you as well.

How have you installed CRISPResso, or how are you running it?

KellenX commented 3 years ago

I used conda 4.10.3 to install CRISPResso2 on my university computing cluster Operating System: CentOS Linux 7 (Core) CPE OS Name: cpe:/o:centos:centos:7 Kernel: Linux 3.10.0-1160.25.1.el7.x86_64

(base) Me@Cluster:~$ conda create -n crispresso2_env -c bioconda crispresso2 python=2.7 (base) Me@Cluster:~$ conda activate crispresso2_env (crispresso2_env) Me@Cluster:~$ CRISPResso -h

This installation allowed me to run CRISPResso on one pair of fastq.gz inputs at a time. Can you please provide me a list of inputs to test running CRISPRessoBatch?

kclem commented 3 years ago

For batch: Download these files: https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/FANC.local.batch https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/FANC.Cas9.fastq https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/FANC.Untreated.fastq

Then you should be able to run:

CRISPRessoBatch -bs FANC.local.batch -a CGGATGTTCCAATCAGTACGCAGAGAGTCGCCGTCTCCAAGGTGAAAGCGGAAGTAGGGCCTTCGCGCACCTCATGGAATCCCTTCTGCAGCACCTGGATCGCTTTTCCGAGCTTCTGGCGGTCTCAAGCACTACCTACGTCAGCACCTGGGACCCCGCCACCGTGCGCCGGGCCTTGCAGTGGGCGCGCTACCTGCGCCACATCCATCGGCGCTTTGGTCGG -g GGAATCCCTTCTGCAGCACC -p 2 --debug --base_editor

KellenX commented 3 years ago

I just tested CRISPRessoBatch and it worked perfectly! It seems that this mysterious bug impaired CRISPRessoPooled only. I'm willing to help with finding it out.

kclem commented 3 years ago

tldr; in your conda env run:

conda install tbb=2020.2

and then run your CRISPRessoPooled command. I hope that works.

Full answer: For the CRISPRessoPooled bit, I think I may know what is going on. I don't have a centos machine, but I tried the following:

docker run -it centos:centos7 /bin/bash
yum install -y wget unzip gcc openssl
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p /root/miniconda3
/root/miniconda3/condabin/conda init
source /root/.bashrc
conda create -y -n crispresso2_env -c bioconda crispresso2 python=2.7
conda activate crispresso2_env
conda install -y -c conda-forge openssl=1.0.2 #for samtools
wget https://github.com/pinellolab/CRISPResso2/archive/refs/heads/master.zip
unzip master.zip
cd CRISPResso2-master/
python setup.py install
wget https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/Cas9.amplicons.txt
wget https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/Both.Cas9.fastq
wget https://raw.githubusercontent.com/pinellolab/CRISPResso2/master/tests/Cas9.regions.txt
CRISPRessoPooled -r1 Both.Cas9.fastq -f Cas9.amplicons.txt -p 2 --keep_intermediate --min_reads_to_use_region 100 

Then, in the CRISPRessoPooled_RUNNING_LOG.txt I see: /root/miniconda3/envs/crispresso2_env/bin/bowtie2-build-s: error while loading shared libraries: libtbb.so.2: cannot open shared object file: No such file or directory

This can be solved by running conda install tbb=2020.2

Now, everything works correctly on my end. Does this solve your problem as well?

KellenX commented 3 years ago

Thank you for this reply! This problem is eventually solved by downgrading tbb.

Before, CRISPRessoPooled failed on my end with

Name Version Build Channel tbb 2021.3.0 h4bd325d_0 conda-forge

By executing conda install tbb=2020.2 I get

Name Version Build Channel tbb 2020.2 h4bd325d_4 conda-forge

Everything works well now.