lucapinello / CRISPResso

Software pipeline for the analysis of CRISPR-Cas9 genome editing outcomes from sequencing data
Other
131 stars 55 forks source link

Error with FLASH for CRISPResso #24

Closed mtoetzl closed 6 years ago

mtoetzl commented 6 years ago

I tried several times running my already split paired-end reads with CRISPResso, unfortunately this is the result I get every time.

[Command used]: CRISPResso /Users/mtoetzl/anaconda/bin/CRISPResso -r1 3_HeLa_SG1_293817w_CA3_R1.fastq -r2 3_HeLa_SG1_293817w_CA3_R2.fastq -a CAAGGCTGAAATTGAGAATGAAGACTATAGTTATACAAAAGATGGAATAGGACTAGATTTGGAAAATTCTTTTAGTAACATTCTGTTATTTGTTCCTGAGTACTTAGACTTCATGCAGAATGGTAACTACTTTCTGATTTTTGTGAAGTCATGGAGCTTGAACACCTCTGGTCTGCGGATTACCACCTTGAGCTCCAATTTGTACAAAAGAGATATAACATCTGCAAAAGTCATGAATGCCACTGCTGCACTGGAGTTCCTCAAAGACATGAA -g GGTGGTAATCCGCAGACCAGAGG

[Execution log]: Estimating average read length... Merging paired sequences with Flash... [FLASH] ERROR: Maximum overlap (-97) cannot be less than the minimum overlap (4). Please make sure you have provided the read length and fragment length correctly. Or, alternatively, specify the minimum and maximum overlap manually with the --min-overlap and --max-overlap options. [FLASH] FLASH did not complete successfully; exiting with failure status (1) Merging error, please check your input.

ERROR: Flash failed to run, please check the log file.

lucapinello commented 6 years ago

Hi mtoetzl, Are the reads overlapping (we suggest at least 5bp)? If not unfortunately you cannot use CRISPResso since the reads when merged should fully cover the amplicon.

You have an amplicon that is long 273 bp so you need to have reads at least of 140bp.

mtoetzl commented 6 years ago

Hi lucapinello,

yes, the reads are overlapping by approximately 10bp

lucapinello commented 6 years ago

Can you share with me the fastq files? So I can reproduce the error.

mtoetzl commented 6 years ago

of course, thank you (i had to compress the two files)

Archive.zip

lucapinello commented 6 years ago

The fastq files are malformed/truncated the first line is a quality line and not a sequence id:

head 3_HeLa_SG1_293817w_CA3_R1.fastq AGBGH0FF1FEFHHHHHHHGHFGGGFF2EFA2DHHHFBGHCFHHHEDGFGHHCCBCA1B0/>/BD2BGF1//E@@/EFHGBGGFHG2B01B1B>DG2BFHHFEGHHHHHBBGF

Can you send me the exact files you were using? I need the entire files.

mtoetzl commented 6 years ago

this is the original raw data file that I used. I split the two paired end reads with a python algorithm provided by the MGH DNA core 3_HeLa_SG1_293817w_CA3.fastq.zip

lucapinello commented 6 years ago

You don't need to do that. For interleaved files like the ones provided by the MGH Core you can use the option: --split_paired_end as described in the manual. So the command is:

CRISPResso -r1 3_HeLa_SG1_293817w_CA3.fastq -a CAAGGCTGAAATTGAGAATGAAGACTATAGTTATACAAAAGATGGAATAGGACTAGATTTGGAAAATTCTTTTAGTAACATTCTGTTATTTGTTCCTGAGTACTTAGACTTCATGCAGAATGGTAACTACTTTCTGATTTTTGTGAAGTCATGGAGCTTGAACACCTCTGGTCTGCGGATTACCACCTTGAGCTCCAATTTGTACAAAAGAGATATAACATCTGCAAAAGTCATGAATGCCACTGCTGCACTGGAGTTCCTCAAAGACATGAA -g GGTGGTAATCCGCAGACCAGAGG --split_paired_end

I am attaching the output.

CRISPResso_on_3_HeLa_SG1_293817w_CA3.zip

Can you please contact the MGH Core and let them know about this? Thanks.

mtoetzl commented 6 years ago

thanks a lot for your help. you mean letting them know about the CRISPResso command that saves you the hassle of splitting the reads?

lucapinello commented 6 years ago

Yep, since many users had exactly your problem and their way to split reads probably has some problems.

I wrote in the documentation:

--split_paired_end: Splits a single fastq file contating paired end reads in two files before running CRISPResso (default: False). If you got your data from the MGH sequencing core in Boston ( https://dnacore.mgh.harvard.edu/new-cgi-bin/site/pages/crispr_sequencing_main.jsp), you need this option!

I am adding this also here so other users can find this solution.

On Thu, Aug 24, 2017 at 9:53 PM, mtoetzl notifications@github.com wrote:

thanks a lot for your help. you mean letting them know about the CRISPResso command that saves you the hassle of splitting the reads?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/lucapinello/CRISPResso/issues/24#issuecomment-324739340, or mute the thread https://github.com/notifications/unsubscribe-auth/ABB_6qf9wlQsrF3ZYACPsIKA0o7V_M6vks5sbdSigaJpZM4PAkZL .

mtoetzl commented 6 years ago

Hi lucapinello,

after analyzing my CRISRPseq data with CRISPResso, I noticed that only a minor fraction of my cell population does have modified sequences (15-30%) after the CRISPR KO. So I revisited the CRISPRvariants file which is provided by the MGH DNA core facility CRISPRseq service and it seems that the real fraction of edited sequence by my KOs is around 90%. Since the CRISPRvariants output file is not really presentable I was using CRISPResso to do a nice visualization of the results. So now I am wondering whether I am using the pipeline correctly, because the difference from the CRISPResso output and the text file from the MGH core is very striking (30% editing vs 90% editing)

lucapinello commented 6 years ago

I am not familiar with their pipeline, but the discrepancy it may be related to the default strict parameters we have to call edited sites.

We use by default a window of 1bp around the cleavage site and exclude the ends of the reads that typically contains artifact.

If you want to disable those filters (I strongly discourage to do that) you can use the option -w 0 --exclude_bp_from_left 0 and --exclude_bp_from_right 0

In addition you can also look to the allele fq file where you can see all the alleles called as edited by CRISPResso.

Regards,

Luca

On Sat, Sep 2, 2017 at 2:26 PM, mtoetzl notifications@github.com wrote:

Hi lucapinello,

after analyzing my CRISRPseq data with CRISPResso, I noticed that only a minor fraction of my cell population does have modified sequences (15-30%) after the CRISPR KO. So I revisited the CRISPRvariants file which is provided by the MGH DNA core facility CRISPRseq service and it seems that the real fraction of edited sequence by my KOs is around 90%. Since the CRISPRvariants output file is not really presentable I was using CRISPResso to do a nice visualization of the results. So now I am wondering whether I am using the pipeline correctly, because the difference from the CRISPResso output and the text file from the MGH core is very striking (30% editing vs 90% editing)

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/lucapinello/CRISPResso/issues/24#issuecomment-326761755, or mute the thread https://github.com/notifications/unsubscribe-auth/ABB_6ulXDTd_prMjNezXTAdz7nM0IdJbks5seZ3TgaJpZM4PAkZL .

mtoetzl commented 6 years ago

thanks for your advice and assistance - I have one more question regarding the output of the analysis: if I compare Unmodified_NHEJ_pie_chart to Alleles_around_cut_site_for_GGTGGTAATCCGCAGACCAGAGG I noticed that the fractions of unmodified alleles in the pie chart do not completely correspond with the percentage of the alleles. So far I didn't manage to come up with an explanation for that. 2.Unmodified_NHEJ_pie_chart.pdf 9.Alleles_around_cut_site_for_GGTGGTAATCCGCAGACCAGAGG.pdf

lucapinello commented 6 years ago

Yes this is normal since in the Alleles_around_cut_site_for_GGTGGTAATCCGCAGACCAGAGG.pdf we report by default alleles with frequency >=0.2% and up to 50 rows . You can see the full list in the file: Alleles_frequency_table_around_cut_site_for_GGTGGTAATCCGCAGACCAGAGG.txt.

You can also change this behaviour with those two parameters:

--min_frequency_alleles_around_cut_to_plot MIN_FREQUENCY_ALLELES_AROUND_CUT_TO_PLOT Minimum % reads required to report an allele in the alleles table plot. (default: 0.2) --max_rows_alleles_around_cut_to_plot MAX_ROWS_ALLELES_AROUND_CUT_TO_PLOT Maximum number of rows to report in the alleles table plot. (default: 50)