pinellolab / CRISPResso2

Analysis of deep sequencing data for rapid and intuitive interpretation of genome editing experiments
Other
278 stars 95 forks source link

CRISPRessoPooled mixed-mode does not work on test data #276

Closed AndreaBarghetti closed 1 year ago

AndreaBarghetti commented 1 year ago

Hi,

I have been trying, unsuccessfully, to run CRISPResso2 pooled with mixed-mode, as according to the documentation.

I noticed that there are tests for CRISPRessoPooled in "amplicon mode":

echo Running CRISPRessoPooled CRISPRessoPooled -r1 Both.Cas9.fastq -f Cas9.amplicons.txt -p 2 --keep_intermediate --min_reads_to_use_region 100 --debug &> CRISPRessoPooled_on_Both.Cas9.log but I couldn't see any tests for "the genome mode" and "mixed-mode"

I tried to run CRISPRessoPooled in mixed mode with the test datasets:

docker run -d -v ${PWD}:/DATA -w /DATA -i pinellolab/crispresso2 CRISPRessoPooled -r1 Both.Cas9.fastq -x smallGenome/smallGenome -f Cas9.amplicons.txt -p 2 --keep_intermediate --min_reads_to_use_region 100 --debug -o ./output

It runs without errors, but the output is not what one would expect:

CRISPRessoPooled_RUNNING_LOG.txt:

CRISPResso version 2.2.11a [Command used]: /opt/conda/bin/CRISPRessoPooled -r1 tests_crispresso2/Both.Cas9.fastq -x tests_crispresso2/smallGenome/smallGenome -f tests_crispresso2/Cas9.amplicons.txt -p 8 --keep_intermediate --min_reads_to_useregion 100 --debug -o ./output [Execution log]: Processing input Unable to find matches for header values. Using the default header values and order. Header variable names in order: ['amplicon_name', 'amplicon_seq', 'guide_seq', 'expected_hdr_amplicon_seq', 'coding_seq'] Mapping amplicons to the reference genome... The amplicon [FANC] was mapped to: chr11:1047-1270 The amplicon [HEK3] was mapped to: chr9:966-1198 The uncompressed reference fasta file for tests_crispresso2/smallGenome/smallGenome is already present! Skipping generation. Aligning reads to the provided genome index... Aligning with command: bowtie2 -x tests_crispresso2/smallGenome/smallGenome -p 8 --end-to-end -N 0 --np 0 --mp 3,2 --score-min L,-5,-1.2000000000000002 -U /DATA/output/CRISPRessoPooled_on_Both.Cas9/Both.Cas9.fastq 2>>/DATA/output/CRISPRessoPooled_on_Both.Cas9/CRISPRessoPooled_RUNNING_LOG.txt| samtools view -bS - | samtools sort -@ 8 - -o /DATA/output/CRISPRessoPooled_on_Both.Cas9/Both.Cas9_GENOME_ALIGNED.bam 500 reads; of these: 500 (100.00%) were unpaired; of these: 2 (0.40%) aligned 0 times 480 (96.00%) aligned exactly 1 time 18 (3.60%) aligned >1 times 99.60% overall alignment rate Preparing to demultiplex reads aligned to the genome... Wrote demultiplexing commands to /DATA/output/CRISPRessoPooled_on_Both.Cas9/DEMUX_COMMANDS.txt Demultiplexing reads by location (2 genomic regions)... Processing amplicon: FANC The amplicon FANC doesn't have any reads mapped to it! Please check your amplicon sequence. Processing amplicon: HEK3 The amplicon HEK3 doesn't have any reads mapped to it! Please check your amplicon sequence. Reporting problematic regions... Skipping the folder CRISPResso_on_FANC: not enough reads, incomplete, or empty folder. Skipping the folder CRISPResso_onHEK3: not enough reads, incomplete, or empty folder. Less than half (0/500) of reads aligned to amplicons. Finding most frequent unaligned reads. Perhaps one or more of the given amplicon sequences were incomplete or incorrect. Below is a list of the most frequent unaligned reads (in the first 10000 unaligned reads). Check this list to see if an amplicon is among these reads. GAGTCTTAGACACAACATTGAAGATGGAAGCGTTCGGTAGCAGATTTGGGTCTAGGGGAATATGGTCCTTCTTGAGTGGATAACAGCTGCTGGG ATTACGGATGTTCCAATCAGTTCCGGGGATTAGCGAACTTAGAGCACACGTCTGAACTCCAGTCACCGATGTATATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAACACAAAGCATAGATTGCGGGACGGGTAAGCATGACTAGTTGCAAACAAGTGTAGAATATATGATGATGTCATACGAACAGTTTGACAGATGGGGCTGG All Done!

CRISPResso_amplicon_aligned_locations.txt

amplicon_name chr_id bpstart bpend strand reference_seq FANC chr11 1047 1270 - CCGACCAAAGCGCCGATGGATGTGGCGCAGGTAGCGCGCCCACTGCAAGGCCCGGCGCACGGTGGCGGGGTCCCAGGTGCTGACGTAGGTAGTGCTTGAGACCGCCAGAAGCTCGGAAAAGCGATCCAGGTGCTGCAGAAGGGATTCCATGAGGTGCGCGAAGGCCCTACTTCCGCTTTCACCTTGGAGACGGCGACTCTCTGCGTACTGATTGGAACATCCG HEK3 chr9 966 1198 + GGAAACGCCCATGCAATTAGTCTATTTCTGCTGCAAGTAAGCATGCATTTGTAGGCTTGATGCTTTTTTTCTGCTTCTCCAGCCCTGGCCTGGGTCAATCCTTGGGGCCCAGACTGAGCACGTGATGGCAGAGGAAAGGAAGCCCTGCTTCCTCCAGAGGGCGTCGCAGGACAGCTTTTCCTAGACAGGGGCTAGTATGTGCAGCTCCTGCACCGGGATACTGGTTGACAAG

SAMPLES_QUANTIFICATION_SUMMARY.txt

Name Unmodified% Modified% Reads_total Reads_aligned Unmodified Modified Discarded Insertions Deletions Substitutions Only Insertions Only Deletions Only Substitutions Insertions and Deletions Insertions and Substitutions Deletions and Substitutions Insertions Deletions and Substitutions FANC NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA HEK3 NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Is the mixed-mode still supported?

Thanks!

Colelyman commented 1 year ago

Hi @AndreaBarghetti,

Thanks for using CRISPResso and submitting this issue! The mixed-mode of CRISPRessoPooled is currently supported, and there is definitely something weird going on here. I am looking into it and will keep this thread updated with what I find.

Have you happened to try the mixed-mode on any different datasets? If so, do you see similar results?

Thanks, Cole

AndreaBarghetti commented 1 year ago

Thanks for working on it.

I tried on my own dataset but got the same issue.

I have only tried it using the docker image however

Colelyman commented 1 year ago

The issue with this particular case is that the reads don't exactly align to the start/end of the amplicons (see attached image of the aligned reads on top compared to the aligned amplicon on bottom). You can remedy this by adding the --demultiplex_only_at_amplicons which resolves the issue by including reads that overlap with the amplicon instead of demultiplexing by only exact position alignment. In a future release we will have the --demultiplex_only_at_amplicons be the default when in mixed-mode to mitigate this issue.

Note that in the genome mode or the amplicon mode the exact start/end check isn't performed, which is why the same data runs as expected in these other modes.

Please let us know if you still see unexpected results on your data when using the --demultiplex_only_at_amplicons parameter!

Screen Shot 2023-01-27 at 10 03 56 AM

AndreaBarghetti commented 1 year ago

Have you tested it on the test dataset ?

Doesn't seem to work for me.

I tried to run CRISPResso2 on the test dataset, with --demultiplex_only_at_amplicons, as you suggested:

docker run -d -v ${PWD}:/DATA -w /DATA -i pinellolab/crispresso2 CRISPRessoPooled -r1 Both.Cas9.fastq -x smallGenome/smallGenome -f Cas9.amplicons.txt -p 8 --keep_intermediate --name AMPLICONS_AND_GENOME_SRR1046762 --min_reads_to_use_region 100 --debug --demultiplex_only_at_amplicons -o ./output3

here's the CRISPRessoPooled_RUNNING_LOG:

CRISPResso version 2.2.11a [Command used]: /opt/conda/bin/CRISPRessoPooled -r1 Both.Cas9.fastq -x smallGenome/smallGenome -f Cas9.amplicons.txt -p 8 --keep_intermediate --name AMPLICONS_AND_GENOME_SRR1046762 --min_reads_to_use_region 100 --debug --demultiplex_only_at_amplicons -o ./output3

[Execution log]: Processing input Unable to find matches for header values. Using the default header values and order. Header variable names in order: ['amplicon_name', 'amplicon_seq', 'guide_seq', 'expected_hdr_amplicon_seq', 'coding_seq'] Mapping amplicons to the reference genome... The amplicon [FANC] was mapped to: chr11:1047-1270 The amplicon [HEK3] was mapped to: chr9:966-1198 The uncompressed reference fasta file for smallGenome/smallGenome is already present! Skipping generation. Aligning reads to the provided genome index... Aligning with command: bowtie2 -x smallGenome/smallGenome -p 8 --end-to-end -N 0 --np 0 --mp 3,2 --score-min L,-5,-1.2000000000000002 -U /DATA/output3/CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762/Both.Cas9.fastq 2>>/DATA/output3/CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762/CRISPRessoPooled_RUNNING_LOG.txt| samtools view -bS - | samtools sort -@ 8 - -o /DATA/output3/CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762/AMPLICONS_AND_GENOME_SRR1046762_GENOME_ALIGNED.bam 500 reads; of these: 500 (100.00%) were unpaired; of these: 2 (0.40%) aligned 0 times 480 (96.00%) aligned exactly 1 time 18 (3.60%) aligned >1 times 99.60% overall alignment rate Preparing to demultiplex reads aligned to positions overlapping amplicons in the genome... Wrote demultiplexing commands to /DATA/output3/CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762/DEMUX_COMMANDS.txt Demultiplexing reads by location (2 genomic regions)... Processing amplicon: FANC

The amplicon [FANC] has enough reads (248) mapped to it! Running CRISPResso!

Running CRISPResso:CRISPResso -r1 /DATA/output3/CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762/MAPPED_REGIONS/REGION_chr11_1047_1270.fastq.gz -o /DATA/output3/CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762 --name FANC --trimmomatic_command trimmomatic --prime_editing_pegRNA_scaffold_min_match_length 1 --n_processes 1 --conversion_nuc_from C --flexiguide_homology 80 --default_min_aln_score 60 --aln_seed_count 5 --quantification_window_size 1 --exclude_bp_from_left 15 --keep_intermediate --conversion_nuc_to T --aln_seed_min 2 --needleman_wunsch_gap_extend -2 --min_paired_end_reads_overlap 10 --exclude_bp_from_right 15 --bowtie2_index smallGenome/smallGenome --quantification_window_center -3 --max_paired_end_reads_overlap 100 --min_average_read_quality 0 --needleman_wunsch_gap_open -20 --min_bp_quality_or_N 0 --min_frequency_alleles_around_cut_to_plot 0.2 --prime_editing_pegRNA_extension_quantification_window_size 5 --max_rows_alleles_around_cut_to_plot 50 --flash_command flash --aln_seed_len 10 --min_single_bp_quality 0 --needleman_wunsch_aln_matrix_loc EDNAFULL --needleman_wunsch_gap_incentive 1 --debug --guide_seq GGAATCCCTTCTGCAGCACC --plot_window_size 20 Processing amplicon: HEK3

The amplicon [HEK3] has enough reads (250) mapped to it! Running CRISPResso!

Running CRISPResso:CRISPResso -r1 /DATA/output3/CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762/MAPPED_REGIONS/REGION_chr9_966_1198.fastq.gz -o /DATA/output3/CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762 --name HEK3 --trimmomatic_command trimmomatic --prime_editing_pegRNA_scaffold_min_match_length 1 --n_processes 1 --conversion_nuc_from C --flexiguide_homology 80 --default_min_aln_score 60 --aln_seed_count 5 --quantification_window_size 1 --exclude_bp_from_left 15 --keep_intermediate --conversion_nuc_to T --aln_seed_min 2 --needleman_wunsch_gap_extend -2 --min_paired_end_reads_overlap 10 --exclude_bp_from_right 15 --bowtie2_index smallGenome/smallGenome --quantification_window_center -3 --max_paired_end_reads_overlap 100 --min_average_read_quality 0 --needleman_wunsch_gap_open -20 --min_bp_quality_or_N 0 --min_frequency_alleles_around_cut_to_plot 0.2 --prime_editing_pegRNA_extension_quantification_window_size 5 --max_rows_alleles_around_cut_to_plot 50 --flash_command flash --aln_seed_len 10 --min_single_bp_quality 0 --needleman_wunsch_aln_matrix_loc EDNAFULL --needleman_wunsch_gap_incentive 1 --debug --guide_seq GGCCCAGACTGAGCACGTGA --plot_window_size 20

ERROR: CRISPResso amplicon #0 failed. For more information, try running the command: "CRISPResso -r1 /DATA/output3/CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762/MAPPED_REGIONS/REGION_chr11_1047_1270.fastq.gz -o /DATA/output3/CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762 --name FANC --trimmomatic_command trimmomatic --prime_editing_pegRNA_scaffold_min_match_length 1 --n_processes 1 --conversion_nuc_from C --flexiguide_homology 80 --default_min_aln_score 60 --aln_seed_count 5 --quantification_window_size 1 --exclude_bp_from_left 15 --keep_intermediate --conversion_nuc_to T --aln_seed_min 2 --needleman_wunsch_gap_extend -2 --min_paired_end_reads_overlap 10 --exclude_bp_from_right 15 --bowtie2_index smallGenome/smallGenome --quantification_window_center -3 --max_paired_end_reads_overlap 100 --min_average_read_quality 0 --needleman_wunsch_gap_open -20 --min_bp_quality_or_N 0 --min_frequency_alleles_around_cut_to_plot 0.2 --prime_editing_pegRNA_extension_quantification_window_size 5 --max_rows_alleles_around_cut_to_plot 50 --flash_command flash --aln_seed_len 10 --min_single_bp_quality 0 --needleman_wunsch_aln_matrix_loc EDNAFULL --needleman_wunsch_gap_incentive 1 --debug --guide_seq GGAATCCCTTCTGCAGCACC --plot_window_size 20"

Colelyman commented 1 year ago

My apologies @AndreaBarghetti I forgot to mention that this PR https://github.com/edilytics/CRISPResso2/pull/11 needs to be merged in first, which will be released soon.

Colelyman commented 1 year ago

The fixes were released last week, would you mind seeing if it works with your data now @AndreaBarghetti?

AndreaBarghetti commented 1 year ago

I'll try ASAP and let you know. Thanks !

AndreaBarghetti commented 1 year ago

The fixes were released last week, would you mind seeing if it works with your data now @AndreaBarghetti?

I tested it on my own dataset and it all worked fine!

Thanks a lot for solving the issue so quickly, Amazing job!

Colelyman commented 1 year ago

Great to hear! Thanks for using CRISPResso2 and happy analyzing!