pinellolab / CRISPResso2

Analysis of deep sequencing data for rapid and intuitive interpretation of genome editing experiments
Other
278 stars 95 forks source link

amplicon for Targeted Sequencing #475

Open tangxj98 opened 3 months ago

tangxj98 commented 3 months ago

I am working on a CRISPR data treated by two sgRNAs and sequenced with targeted sequencing. The distance between the two sgRNA is 3000bp which is much longer than amplicon. The targeted area is the whole gene (380Kbp). Which mode should I use?

I tried to run with CRISPRessoWGS but got errors. My WGS mode command: CRISPRessoWGS -b S106_D.sorted.bam -f focused.region.txt -r refs/human/GRCh38/processed/2015_04_04/seqs_for_alignment_pipelines.ucsc_ids/bwa_index/GCA_000001405.15_GRCh38_no_alt_short_headers_nonACTG_to_N.fa --name CRISPR_S106_D -g TTAAACTTGTGTTTGCTGCG,GAGGACGTCCCTGTCGATGT

The error message asked me to try the amplicon mode:

ERROR: CRISPResso region #0 failed. For more information, try running the command: " CRISPResso -r1 CRISPRessoWGS_on_CRISPR_S106_D/ANALYZED_REGIONS/REGION_0.fastq.gz -a CTGGCTCCTTCTGTTGTTTCTCTTGGCTCCAGGACCCCCGCAGCAAACACAAGTTTAAGATCCACACGTACTCCAGCCCCAC -o CRISPRessoWGS_on_CRISPR_S106_D --name exon4-1 --max_rows_alleles_around_cut_to_plot 50 --prime_editing_pegRNA_scaffold_min_match_length 1 --needleman_wunsch_gap_extend -2 --needleman_wunsch_aln_matrix_loc EDNAFULL --conversion_nuc_to T --n_processes 1 --aln_seed_len 10 --trimmomatic_command trimmomatic --min_frequency_alleles_around_cut_to_plot 0.2 --min_bp_quality_or_N 0 --aln_seed_min 2 --flash_command flash --default_min_aln_score 60 --flexiguide_homology 80 --min_average_read_quality 0 --prime_editing_pegRNA_extension_quantification_window_size 5 --min_paired_end_reads_overlap 10 --needleman_wunsch_gap_incentive 1 --plot_window_size 20 --quantification_window_center -3 --quantification_window_size 1 --min_single_bp_quality 0 --max_paired_end_reads_overlap 100 --needleman_wunsch_gap_open -20 --conversion_nuc_from C --exclude_bp_from_left 15 --aln_seed_count 5 --exclude_bp_from_right 15 --guide_seq TTAAACTTGTGTTTGCTGCG,GAGGACGTCCCTGTCGATGT &> lo

I tried and the error message is: ERROR: The guide sequence 1 (GAGGACGTCCCTGTCGATGT) provided is not present in the amplicon sequences!

I don't know how the amplicon sequence (CTGGCTCCTTCTGTTGTTTCTCTTGGCTCCAGGACCCCCGCAGCAAACACAAGTTTAAGATCCACACGTACTCCAGCCCCAC) was made up by the code. Could you please give some suggestion about how to work with my targeted sequencing data?

Thank you very much!

kclem commented 2 months ago

Hi @tangxj98,

The amplicon sequence (CTGGCTCCTTCTGTTGTTTCTCTTGGCTCCAGGACCCCCGCAGCAAACACAAGTTTAAGATCCACACGTACTCCAGCCCCAC) is the sequence of the first region in your region file (focused.region.txt) as contained in your genome file (GCA_000001405.15_GRCh38_no_alt_short_headers_nonACTG_to_N.fa).

What is the design of your targeted sequencing experiment? Do you have two primers outside of the guides amplifying the entire interior region? If you are expecting deletion of the interior gene, the short PCR products will be artificially inflated compared to the longer non-deletion amplicons.

I would suggest designing multiple primers to generate amplicons of approximately the same size like this:

    --P1-->       <--P2--                                            --P3-->     <--P4--
              |g1                 {380Kbp gene}                              |g2

This strategy would allow you to measure small indels (P1/P2 and P3/P4) as well as the long deletion (P1/P4).

tangxj98 commented 2 months ago

Hi @kclem ,

Thanks for the reply. The design of the targeted sequencing actually contains tiles of primers that covers the whole genes. Below is how the customized targeted sequencing panel look like in genome browser.

image

The focused.region.txt I used was a bed file containing two 100bp regions spanning over the two double strand break points of the sgRNAs. Is this the right way to make the focused.region.txt? Sorry I am a newbie with CRISPR. I sincerely appreciate your suggestion.

kclem commented 2 months ago

By specifying two 100bp regions, you'll be able to estimate the indel rate at the two sites, but you won't be able to measure the rate at which the interior gene is deleted (because the reads from cells where the gene is deleted won't align to your genome).

Note that reads must cover the entire 100bp region to be included in the analysis, so I would also suggest shrinking the region to ~30bp depending on your read length.

Again, the main problem with using WGS for this analysis is that the reads that support the large gene deletion aren't aligned to the genome reference and won't be considered for analysis.

If you want to quantify indels at each target site as well as the large deletion, I would suggest creating 3 reference sequences: 1) the 100bp reference at the left cut site, 2) 100bp reference at the right cut site and 3) reference where the interior gene is deleted (probably 50bp from each arm, so 100bp in total). You can use CRISPRessoPooled to align reads to these references and analyze cutting frequencies.

I hope that helps!

tangxj98 commented 2 months ago

Hi @kclem,

I used the CRISPRessoPooled to run the job. CRISPRessoPooled -r1 S106_Par.FCAFM53VM5_L1_R1_IGTTCGCCAGT-GATAAGTCGA.fastq.gz -r2 S106_Par.FCAFM53VM5_L1_R2_IGTTCGCCAGT-GATAAGTCGA.fastq.gz -f AMPLICONS_FILE.txt --name AMPLICONS_S106_Par --max_paired_end_reads_overlap 160 --keep_intermediate >& AMPLICONS_S106_Par.log &

However, it is very weird that the intermediate bam had no alignment reads. From the log: Alignment command: bowtie2 -x CRISPRessoPooled_on_AMPLICONS_S106_Par/CUSTOM_BOWTIE2_INDEX -p 1 --end-to-end -N 0 --np 0 --mp 3,2 --score-min L,-5,-1.2000000000000002 -U CRISPRessoPooled_on_AMPLICONS_S106_Par/out.extendedFrags.fastq.gz 2>>CRISPRessoPooled_on_AMPLICONS_S106_Par/CRISPRessoPooled_RUNNING_LOG.txt | samtools view -bS - > CRISPRessoPooled_on_AMPLICONS_S106_Par/CRISPResso_AMPLICONS_ALIGNED.bam 13616521 reads; of these: 13616521 (100.00%) were unpaired; of these: 13616521 (100.00%) aligned 0 times 0 (0.00%) aligned exactly 1 time 0 (0.00%) aligned >1 times 0.00% overall alignment rate

The AMPLICON file: $ cat ../AMPLICONS_FILE.txt aroundExon4_1 TGGAGTACGTGTGGATCTTAAACTTGTGTTTGCTGCGGGGGTCCTGGAGCCAAGAG TTAAACTTGTGTTTGCTGCG aroundExon5_9 GCCACCTACCGAGGACAATGAGGACGTCCCTGTCGATGTGGGCCTGGATGTAGATG GAGGACGTCCCTGTCGATGT deletion CTACCGAGGACAATGAGGACGTCCCTGTCGATGCGGGGGTCCTGGAGCCAAGAGAA NA

What makes it weird is that, I can grep the whole "TGGAGTACGTGTGGATCTTAAACTTGTGTTTGCTGCGGGGGTCCTGGAGCCAAGAG" from the intermediate out.extendedFrags.fastq.gz and found many exact matches. How came the bowtie alignment turned out having nothing mapped? Could you please provide some insights here? Or shall I use a mixed or genome mode to run it?

Many thanks!

kclem commented 2 months ago

How long are your reads? Perhaps bowtie has a hard time aligning reads to references shorter than the read length. Yes, you could try running in mixed mode or setting a larger amplicon length in your AMPLICONS_FILE.txt.

You should also check that your AMPLICONS_FILE.txt is correct. In your CRISPRessoPooled command you reference AMPLICONS_FILE.txt, but you cat ../AMPLICONS_FILE.txt.

tangxj98 commented 2 months ago

My read is 150bp. I will try to make a larger amplicon that's longer than the read length.

The previous AMPLICONS_FILE.txt looks like: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

aroundExon4_1 | TGGAGTACGTGTGGATCTTAAACTTGTGTTTGCTGCGGGGGTCCTGGAGCCAAGAG | TTAAACTTGTGTTTGCTGCG -- | -- | -- aroundExon5_9 | GCCACCTACCGAGGACAATGAGGACGTCCCTGTCGATGTGGGCCTGGATGTAGATG | GAGGACGTCCCTGTCGATGT deletion | CTACCGAGGACAATGAGGACGTCCCTGTCGATGCGGGGGTCCTGGAGCCAAGAGAA | NA

tangxj98 commented 2 months ago

Hi @kclem

I generated an amplicons_file.txt with 210 bp long using the same format as the above, which did help. I also upgraded to CRISPResso2 v2.3.1, solving the numpy error. However, I still had a problem died on amplicon #2. I ran the command prompted by the program.

CRISPResso -r1 CRISPRessoPooled_on_AMPLICONS_S106_Par/AMPL_deletion.fastq.gz -a CACCCCTCCGCCACCCTCCCCTTCTCAATCACACAGGTCAAGCAAGGTCAGGAGCCAGTGGAGCCCCAGGGCCACCTACCGAGGACAATGAGGACGTCCCTGTCGTGCGGGGGTCCTGGAGCCAAGAGAAACAACAGAAGGAGCCAGCGTCAGTGGAGCATGCCAACGGTTCATCCTACATCGAGAGGCACTGGAGAACTCACCAAGAAC -o CRISPRessoPooled_on_AMPLICONS_S106_Par --name deletion --trimmomatic_command None --quantification_window_size 1 --fastp_options_string " --disable_adapter_trimming --disable_trim_poly_g --disable_quality_filtering --disable_length_filtering" --needleman_wunsch_gap_incentive 1 --n_processes 1 --flexiguide_homology 80 --min_paired_end_reads_overlap 10 --prime_editing_gap_open_penalty -50 --min_average_read_quality 0 --aln_seed_count 5 --quantification_window_center -3 --prime_editing_pegRNA_scaffold_min_match_length 1 --flash_command None --conversion_nuc_from C --default_min_aln_score 60 --conversion_nuc_to T --flexiguide_seq None --needleman_wunsch_gap_extend -2 --needleman_wunsch_gap_open -20 --keep_intermediate --min_single_bp_quality 0 --exclude_bp_from_right 15 --aln_seed_min 2 --prime_editing_gap_extend_penalty 0 --plot_window_size 20 --exclude_bp_from_left 15 --max_rows_alleles_around_cut_to_plot 50 --min_bp_quality_or_N 0 --aln_seed_len 10 --verbosity 3 --max_paired_end_reads_overlap 160 --needleman_wunsch_aln_matrix_loc EDNAFULL --fastp_command fastp --min_frequency_alleles_around_cut_to_plot 0.2 --prime_editing_pegRNA_extension_quantification_window_size 5 --config_file None --amplicon_seq CACCCCTCCGCCACCCTCCCCTTCTCAATCACACAGGTCAAGCAAGGTCAGGAGCCAGTGGAGCCCCAGGGCCACCTACCGAGGACAATGAGGACGTCCCTGTCGTGCGGGGGTCCTGGAGCCAAGAGAAACAACAGAAGGAGCCAGCGTCAGTGGAGCATGCCAACGGTTCATCCTACATCGAGAGGCACTGGAGAACTCACCAAGAAC

The error message was: if N_TOTAL == 0: raise CRISPRessoShared.NoReadsAlignedException('Error: No alignments were found')

I checked the intermediate file: $ samtools view -F4 CRISPResso_AMPLICONS_ALIGNED.bam | cut -f3 | sort | uniq -c 1672 AMPL_aroundExon4_1 8419 AMPL_aroundExon5_9 8798 AMPL_deletion

Amplicon #2 obviously have a lot of mapped reads. Why did I get the error of "No alignments were found?"

kclem commented 2 months ago

I'm not sure exactly what the problem is. You could try:

1) reduce the default min alignment score to 0 --default_min_aln_score 0 - if your reads cover less than 60% of the reference they will be discarded. Setting this parameter to 0 might help.

2) Run with CRISPRessoWGS using the CRISPResso_AMPLICONS_ALIGNED.bam as input and defining a regions file based on the coordinates in your artificial chromosomes (AMPL_aroundExon4_1, AMPL_aroundExon5_9, and AMPL_deletion)

3) Run with --debug to see if there is anything else going on.

tangxj98 commented 2 months ago

Hi @kclem Using --default_min_aln_score 0 does eliminate the problem! I finished run without any further errors. Thank you very much for the help!

tangxj98 commented 2 months ago

Additional question in CRISPRessoAggregate. It seems that a lot of plots failed due to the same two errors:

  1. Plot error plot_nucleotide_quilt() missing 1 required positional argument: 'custom_colors', skipping plot
  2. ERROR: make_multi_report() missing 2 required positional arguments: 'crispresso_tool' and 'logger'

This doesn't seem a parameter that the user can set. Is there any quick fix for these errors?

Colelyman commented 2 months ago

Thanks @tangxj98, this bug has been fixed and the fix will be available in the next release. If you want to have the fix now, you can clone the repo (git clone https://github.com/pinellolab/CRISPResso2.git) and then move into the directory (cd CRISPResso2) and install it (pip install .). This should get you the version with these fixes!

tangxj98 commented 2 months ago

Thanks @tangxj98, this bug has been fixed and the fix will be available in the next release. If you want to have the fix now, you can clone the repo (git clone https://github.com/pinellolab/CRISPResso2.git) and then move into the directory (cd CRISPResso2) and install it (pip install .). This should get you the version with these fixes!

Thank you, @Colelyman and @kclem . I installed the v2.3.2, the problem in Aggregate is gone. Thanks!

Additional problems (sorry!) :

  1. I still have problem in CRISPRessoPooledWGSCompare. The error message is: ERROR: All values in table must be nonnegative. I checked in the code but not found the error message. Google search showed that it might be a problem in p-value calculation. This only happens in the analysis for the amplicon for deletion.

  2. Among the samples I processed, there is a wild type sample, which should not have any reads mapping to the deletion. But I did see ~8000 reads mapped to the WT samples. I checked the bam files and they looked pretty bad mapping quality and not really what I want. My deletion amplicon was made up with 105 bp on both arm as suggested by @kclem since my sequence reads are 150bp long. Is there any parameter that I can restrict / filter the mapping?

  3. I also tried to used CRISPRessoWGS on a bwa-mem aligned bam files. I got the following bugs:

INFO @ Thu, 12 Sep 2024 14:55:57 (0.0% done): Extracting reads in:chr16:24035416-24035624 and creating .bam file: CRISPRessoWGS_on_AC1_Par/ANALYZED_REGIONS/REGION_1.bam

~/crispresso2_env/lib/python3.8/site-packages/CRISPResso2/CRISPRessoWGSCORE.py:238: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead. new_df.loc[i] = extract_reads(df.iloc[i].copy()) INFO @ Thu, 12 Sep 2024 14:55:58 (0.0% done): Extracting reads in:chr16:24032041-24035624 and creating .bam file: CRISPRessoWGS_on_AC1_Par/ANALYZED_REGIONS/REGION_2.bam

~/crispresso2_env/lib/python3.8/site-packages/CRISPResso2/CRISPRessoWGSCORE.py:238: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead. new_df.loc[i] = extract_reads(df.iloc[i].copy()) CRITICAL @ Thu, 12 Sep 2024 14:56:05 (0.0% done):

ERROR: infer_objects() got an unexpected keyword argument 'copy'

I think this is a problem from pandas. Could you please let me know the pandas version working with your v2.3.2? My installed version is pandas 1.5.3 (py38h417a72b_0).

Colelyman commented 2 months ago

Thanks @tangxj98, would you mind making separate issues for 1 and 3? Also in those issues could you provide the output with the --debug flag?

As for point 2, could you provide an example of the sequences of the amplicons (including the wild type amplicon) and one of the aligned reads? If you still have the --default_min_aln_score 0 set, this is probably why you are seeing poor alignments. I would recommend checking out the --amplicon_min_alignment_score which allows you to set different minimum alignment scores for your amplicons. This could allow you to make the alignment score needed for the wild type allele to be more strict than the other alleles.

Hope this helps! Cole

kclem commented 2 months ago

Also, if you are aligning reads to 150bp on either side of the deletion, you may also get some WT reads aligning to one of the arms and not necessarily spanning the deletion. You may want to do another selection step where you only include reads that span the deletion.

Colelyman commented 3 weeks ago

Hi @tangxj98,

Just wanted to follow up to see if you are still running into these issues, if so please let us know!

Thanks, Cole