lucapinello / CRISPResso

Software pipeline for the analysis of CRISPR-Cas9 genome editing outcomes from sequencing data
Other
131 stars 55 forks source link

Running with non-overlapping reads #2

Closed mdshw5 closed 8 years ago

mdshw5 commented 8 years ago

Is it possible to run CRISPresso with reads that originate from a large amplicon (a few kilobases) that has been fragmented and then sequenced? These reads would not map to only the ends of the target amplicon...

lucapinello commented 8 years ago

Hi Matt, unfortunately there is no utility in CRISPResso that covers directly this case.

On the other hand one strategy you may want to use in this case is the following:

1) Map your reads with BWA-mem or Bowtie2 to the large amplicon to create a bam file 2) Create a description file for CRISPRessoWGS containing non overlapping windows of 40-50bp to cover the large amplicon (this file is almost a bed file and you can see the format in the help) 3) Run CRISPRessoWGS using the files created in 1) and 2) and with the same genome used in 1)

After this you will get a nice summary table with NHEJ %, unmodified% and HDR% for each of those windows and you can for example easily plot the profile in the whole amplicon. Also you will get individual folder with the full CRISPResso analysis for each window.

Please let me know if this is clear enough, if not happy to help you with this.

mdshw5 commented 8 years ago

Thanks for the response - it's very clear.

lucapinello commented 8 years ago

Great!

miikaelf commented 7 years ago

Hi Luca, Thank you for the great work and for the explanation above! I have the same problem as mdhw5 - have a long amplicon that has been fragmented and sequenced. I mapped the reads to the reference amplicon sequence with Bowtie2, and converted and sorted the resulting SAM file with samtools. As per Bowtie2 output, the majority of the reads are definitely mapped. However, when I run CRISPRessoWGS on the mapped reads, description file and the amplicon reference sequence, I get the following output:

The index for the reference fasta file is already present! Skipping generation. The region [R1] has not enough reads (0) mapped to it! Skipping the running of CRISPResso! ....

the latter continues for all regions, and at the end it produces no output, as it says no regions are mapped. I've verified the input files - they should be correct. Dependencies work as far as I can tell.

Any thoughts on this error?

See attached files. Run command is:

CRISPRessoWGS -b Jak2_sorted_correct.bam -f Jak2_description2.txt -r Jak2_reference.fasta

Jak2_description2.txt

miikaelf commented 7 years ago

Files zipped

Crispresso_files.tar.gz

lucapinello commented 7 years ago

I think the problem is that you are using a custom index and a not properly formatted fasta file.

In your description file for the regions I see chr19 but in the fasta file you sent me there is no feature called chr19. I think you need to fix this.

Alternatively if this amplicon is a piece of known reference genome, you may want instead to align the reads with bowtie2 to the reference genome and provide: 1) the "chunks" as coordinates in that reference genome. 2) the fasta file used to build the bowtie2 index

Hope this is helpful.

Best,

Luca

miikaelf commented 7 years ago

Many thanks for the input! For the fasta file - how would you format it to include chr19 as a feature? Apologies if this should be assumed knowledge - but I've not found any articles on this. When I tried it with the sequence downloaded from the genomic context from NCBI (which should be correctly formatted), but it produced the same output. I played around changing the names in the description and fasta file, and same result. See file attached.

For the alternative method (alignign the reads with bowtie to referece genome and then providing the reference genome as a fasta file - on the website), I've aligned it to the mouse mm10 build using bowtie2 in Galaxy. However, for the fasta file, you specify downloading the "genome.fa.gz" file as reference genome from USCS. For human genome, this file exist (hg38.fa.gz), but there is no such equivalent for the mouse genome, for any of the builds (see here: http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/). For the mouse, which file should we use as the reference?

Again, apologies if this should be assumed knowledge - but I think others may come across the same problem when doing CRISPRessoWGS on mouse. Jak2_reference.fasta.tar.gz

lucapinello commented 7 years ago

1) For FASTA file you need to add on top of the file something like

SEQUENCE_ID AATATTATATA... (your sequence)

Take a look here for the format http://www.bioinformatics.nl/tools/crab_fasta.html and in particular to the indentification code

In your case it should be >chr19, but this depends on the way you built the bowtie2 index

2) CRISPRessoWGS was developed with a reference genome in mind so I will first try with that.

Now to obtain a mm10 fasta sequence you can download the utility TwoBitTofa from here:

http://hgdownload.soe.ucsc.edu/admin/exe/ (several folder for different systems)

and make it executable with

chmod +x twoBitTofa

Then download the mm10 genome in 2bit format:

http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit

Finally type the command:

./twoBitToFa mm10.2bit mm10.fa

More info on 2bit files here: https://genome.ucsc.edu/goldenPath/help/twoBit.html

Hope this is helpful.

Best,

Luca

I see, one quick solution is to download the 2bit file and convert it to fasta with the command

On Tue, Jun 27, 2017 at 12:33 PM, miikaelf notifications@github.com wrote:

Many thanks for the input! For the fasta file - how would you format it to include chr19 as a feature? Apologies if this should be assumed knowledge - but I've not found any articles on this. When I tried it with the sequence downloaded from the genomic context from NCBI (which should be correctly formatted), but it produced the same output. I played around changing the names in the description and fasta file, and same result. See file attached.

For the alternative method (alignign the reads with bowtie to referece genome and then providing the reference genome as a fasta file - on the website), I've aligned it to the mouse mm10 build using bowtie2 in Galaxy. However, for the fasta file, you specify downloading the "genome.fa.gz" file as reference genome from USCS. For human genome, this file exist (hg38.fa.gz), but there is no such equivalent for the mouse genome, for any of the builds (see here: http://hgdownload.soe.ucsc.edu /goldenPath/mm10/bigZips/). For the mouse, which file should we use as the reference?

Again, apologies if this should be assumed knowledge - but I think others may come across the same problem when doing CRISPRessoWGS on mouse. Jak2_reference.fasta.tar.gz https://github.com/lucapinello/CRISPResso/files/1106098/Jak2_reference.fasta.tar.gz

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/lucapinello/CRISPResso/issues/2#issuecomment-311413492, or mute the thread https://github.com/notifications/unsubscribe-auth/ABB_6uct3SbQFsMpCfyJ0rE8Dad-OmlSks5sIS7cgaJpZM4JHoOS .