Alignment issues - Githubissues

jsromanowski commented 8 months ago

Describe the bug I am trying to align CRISPR-edited genomic amplicons to my amplicon reference template. When I align these amplicons to its reference using minimap2, I get 10,000+ reads aligning to the reference and I can see the alignment on IGV. With CRISPresso2, I am barely getting 100 reads aligning. If the alignment efficiency increases, this program could be nearly perfect for my purposes. What is causing this and how can I overcome this obstacle?

To reproduce CRISPResso --fastq_r1 D1_NHEJ_FLsorted.fastq.gz --amplicon_seq TCTAGATCATAATCAGCCATACCACATTTGTAGAGGTTTTACTTGCTTTAAAAAACCTCCCACACCTCCCCCTGAACCTGAAACATAAAATGAATGCAATTGTTGTTGTTAACTTGTTTATTGCAGCTTATAATGGTTACAAATAAAGCAATAGCATCACAAATTTCACAAATAAAGCATTTTTTTCACTGCATTCTAGTTgtggtttgtccaaactcatcaatgtatcttgCCGCTGAGCCGGCCGCTAAAATAAACAACATTATCAGTTAGCCTTTTACATAACATAGAATAAAAATTTCTCATGTTTCCCGTTCGCCCAAGGTAAACTTGCTACGCAACGAAATGTAGTGAGTGAAGCCCGCCAAATGTCCGTCGTGCCCATTCAGAACAAAACGCACAAGTACACATCACACGTGTAGGCTTCAGTAGCATAATCCGTTCACCAGAACAGCGCTACTT

Debug output

                                                                        ~~~CRISPResso 2~~~
                                                 -Analysis of genome editing outcomes from deep sequencing data-

                                                                                 _
                                                                                '  )
                                                                                .-'
                                                                               (____
                                                                            C)|     \
                                                                              \     /
                                                                               \___/

                                                                   [CRISPResso version 2.2.14]

[Note that starting in version 2.3.0 FLASh and Trimmomatic will be replaced by fastp for read merging and trimming. Accordingly, the --flash_command and --trimmomatic_command parameters will be replaced with --fastp_command. Also, --trimmomatic_options_string will be replaced with --fastp_options_string.

Also in version 2.3.0, when running CRISPRessoPooled in mixed-mode (amplicon file and genome are provided) the default behavior will be as if the --demultiplex_only_at_amplicons parameter is provided. This change means that reads and amplicons do not need to align to the exact locations.] [For support contact k.clement@utah.edu or support@edilytics.com]

WARNING @ Thu, 29 Feb 2024 14:38:41: Folder CRISPResso_on_D1_NHEJ_FLsorted already exists.

INFO @ Thu, 29 Feb 2024 14:38:41: Computing quantification windows

INFO @ Thu, 29 Feb 2024 14:38:42: Aligning sequences...

INFO @ Thu, 29 Feb 2024 14:38:42: Processing reads; N_TOT_READS: 0 N_COMPUTED_ALN: 0 N_CACHED_ALN: 0 N_COMPUTED_NOTALN: 0 N_CACHED_NOTALN: 0

INFO @ Thu, 29 Feb 2024 14:39:50: Processing reads; N_TOT_READS: 10000 N_COMPUTED_ALN: 52 N_CACHED_ALN: 0 N_COMPUTED_NOTALN: 9915 N_CACHED_NOTALN: 33

INFO @ Thu, 29 Feb 2024 14:40:05: Processing reads; N_TOT_READS: 20000 N_COMPUTED_ALN: 70 N_CACHED_ALN: 39 N_COMPUTED_NOTALN: 11650 N_CACHED_NOTALN: 8241

INFO @ Thu, 29 Feb 2024 14:40:05: Finished reads; N_TOT_READS: 23524 N_COMPUTED_ALN: 70 N_CACHED_ALN: 70 N_COMPUTED_NOTALN: 11650 N_CACHED_NOTALN: 11734

INFO @ Thu, 29 Feb 2024 14:40:05: Done!

INFO @ Thu, 29 Feb 2024 14:40:05: Quantifying indels/substitutions...

INFO @ Thu, 29 Feb 2024 14:40:05: Done!

INFO @ Thu, 29 Feb 2024 14:40:05: Calculating allele frequencies...

INFO @ Thu, 29 Feb 2024 14:40:05: Done!

INFO @ Thu, 29 Feb 2024 14:40:05: Saving processed data...

INFO @ Thu, 29 Feb 2024 14:40:05: Making Plots...

INFO @ Thu, 29 Feb 2024 14:40:06: Begin processing plots for amplicon Reference

/home/joeromanowski/miniconda3/envs/crispresso2_env/lib/python3.10/site-packages/CRISPResso2/CRISPRessoCORE.py:3578: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing errors and catch exceptions explicitly instead modification_percentage_summary_df = pd.DataFrame(mod_pcts, columns=colnames).apply(pd.to_numeric, errors='ignore') /home/joeromanowski/miniconda3/envs/crispresso2_env/lib/python3.10/site-packages/CRISPResso2/CRISPRessoPlot.py:188: FutureWarning: Series.getitem treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use ser.iloc[pos] ins_pct = float(mod_pct_df_indexed.loc[sampleName,'Insertions_Left'][pos_ind-2]) INFO @ Thu, 29 Feb 2024 14:40:13: Done!

INFO @ Thu, 29 Feb 2024 14:40:13: Done!

INFO @ Thu, 29 Feb 2024 14:40:13: Removing Intermediate files...

INFO @ Thu, 29 Feb 2024 14:40:13: Analysis Complete!

Thanks,

Joe

kclem commented 8 months ago

Hi @jsromanowski,

I hope you find CRISPResso to be intuitive and helpful in your research.

Are you analyzing amplicon sequencing libraries? Your provided amplicon is 462bp long. Are your reads 462bp long? CRISPResso will discard reads with < 60% of bases aligning, so if your reads are 277bp or less (277/462 = .60) they won't be aligned.

To have your reads aligned try: 1) use the correct amplicon sequence corresponding to the read length 2) set the 60% threshold lower by running e.g. with --default_min_aln_score 0

If you're not using amplicon sequencing and some other library prep (e.g. hybrid selection) you may also try passing the aligned bam to CRISPResso via CRISPRessoWGS.

I hope that helps!

kclem commented 8 months ago

You can also try running with the experimental --auto command which will automatically detect the amplicon sequence based on the most frequent read in your sample.

jsromanowski commented 8 months ago

Hello @kclem, CRISPResso2 certainly seems promising, however I am a bit confused on my results.

To answer your earlier question - my amplicon is ~1500 bp long, however I noticed many smaller artifacts in my sequencing file that do not contain my CRISPR target region of interest, and therefore muddled my interpretation of editing efficiency.

To overcome this, I used minimap2 to align my reads to a shorter reference encompassing my CRISPR target locus and used samtools to create a bam file of this alignment (462 bp). I then converted this bam file containing my aligned sequences to a .fastq.gz for CRISPResso2 analysis, however I am noticing two things 1) there are more total reads showing on CRISPRessso2 than when I visualize my .bam file with IGV and 2) when I set my -asam parameter above 20 or 30 I lose a significant amount of reads (roughly 10-20-fold), but these lost reads are shown on IGV with only small indels likely from nanopore sequencing noise.

How can I change my CRISPResso2 inputs/parameters so my editing results are reflected accurately?

Any help to overcome these challenges would be great,

Joe

kclem commented 8 months ago

Hi @jsromanowski,

I'd use your minimap2-aligned bam file as input to CRISPRessoWGS. For the region file, I'd specify 20bp up- and downstream from your CRISPR target locus. This will select reads that cover the 40bp, and provide information about these reads. Note that the current version will discard reads if the entire 40bp region is deleted, but our next release will include a flag to include the read but to show the entire region as deleted. You can find that branch here: https://github.com/pinellolab/CRISPResso2/tree/wgs-whole-region-deleted

Alternately, you can pass the aligned bam to CRISPResso as --bam_input. This should report all of the indels for all of the aligned reads.

Feel free to reach out at k.clement@utah.edu if you would like me to take a look.

Thanks,

Kendell

jsromanowski commented 8 months ago

Hi Kendell,

The --default_min_aln_score suggestion certainly helped! I'll give the CRISPRessoWGS and the -bam_input a shot, as well.

Thank you very much you've been a great help and very responsive!

-Joe

pinellolab / CRISPResso2

Alignment issues #383