maximilianh / crisporPaper

Code for the CRISPOR article, all data and code to create figures and analysis
41 stars 20 forks source link

question about bwa search for off-target #2

Open samuelecancellieri opened 5 years ago

samuelecancellieri commented 5 years ago

Hi, I am a phd student and i'm currently working on developing a software for crispr/cas in-silico off-target prediction. I am testing some software before the start and i'm trying also short aligners as BWA, but i'm encountering some problems. When i search using BWA, for example with a single guide (e.g. GAGTCCGAGCAGAAGAAGAANGG), seems that the software can't found all the possible off-targets that other software found (i.e cas-offinder), I check the settings and all seems to work fine, but i can't avoid this strange behaviour.

I hope the question doesn't bother you, but how you deal with this behaviour in crispor?

Thanks a lot and sorry for the long question.

maximilianh commented 5 years ago

Hi, I would like to know more details about this problem. Notably, how many offtargets were missed by BWA, where they are and how different they were from the target.

I am aware of one case where this issue can happen, two offtargets that overlap, BWA will only find one of them. For all practical purposes that’s fine, as if you test one, you’ll test the other one. But it’s well possible that I haven’t compared enough off targets and have missed serious cases.

Do you have other examples of missed off targets ?

Thanks Max

On Wed 17 Apr 2019 at 02:02, Samuele Cancellieri notifications@github.com wrote:

Hi, I am a phd student and i'm currently working on developing a software for crispr/cas in-silico off-target prediction. I am testing some software before the start and i'm trying also short aligners as BWA, but i'm encountering some problems. When i search using BWA, for example with a single guide (e.g. GAGTCCGAGCAGAAGAAGAANGG), seems that the software can't found all the possible off-targets that other software found (i.e cas-offinder), I check the settings and all seems to work fine, but i can't avoid this strange behaviour.

I hope the question doesn't bother you, but how you deal with this behaviour in crispor?

Thanks a lot and sorry for the long question.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporPaper/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TYEWD54hLdOjnNJ8F0jOBasxiHcwks5vhuM-gaJpZM4c0jNJ .

samuelecancellieri commented 5 years ago

hi, thanks for the very fast reply.

i'm trying using the EMX1 guide with NGG PAM (GAGTCCGAGCAGAAGAAGAANGG) on the hg19 reference genome. i try a search using cas-offinder set with 4 mismatches and no bulges, and i found 298 off-targets and 1 on-target.

i call BWA with these commands: bwa aln -o 0 -m 1200000000000 -n 4 -k 4 -N -l 20 hg19.fa emx1.fa > 4mm.bwa.sai bwa samse -n 600000000000000000 hg19.fa 4mm.bwa.sai emx1.fa | perl xa2multi.pl > 4mm.bwa.sam samtools view -Sb 4mm.bwa.sam > 4mm.bwa.bam bedtools bamtobed -i 4mm.bwa.bam > 4mm.bwa.bed

the emx1.fa is this file:

EMX1_NGG GAGTCCGAGCAGAAGAAGAANGG

and i found only 26 off-targets and 1 on-target.

also some off-targets are found only by BWA and not found by cas-offinder.

i attach all the result files so you can check, maybe i'm doing something wrong.

BWA_results.zip CAS-OFFINDER_emx1.4mm.NGG.zip

thank you a lot. Samuele.

maximilianh commented 5 years ago

How many offtargets does crispor find when you set the mit score cutoff to 0?

On Wed 17 Apr 2019 at 06:08, Samuele Cancellieri notifications@github.com wrote:

hi, thanks for the very fast reply.

i'm trying using the EMX1 guide with NGG PAM (GAGTCCGAGCAGAAGAAGAANGG) on the hg19 reference genome. i try a search using cas-offinder set with 4 mismatches and no bulges, and i found 298 off-targets and 1 on-target.

i call BWA with these commands: bwa aln -o 0 -m 1200000000000 -n 4 -k 4 -N -l 20 hg19.fa emx1.fa > 4mm.bwa.sai bwa samse -n 600000000000000000 hg19.fa 4mm.bwa.sai emx1.fa | perl xa2multi.pl > 4mm.bwa.sam samtools view -Sb 4mm.bwa.sam > 4mm.bwa.bam bedtools bamtobed -i 4mm.bwa.bam > 4mm.bwa.bed

the emx1.fa is this file:

EMX1_NGG GAGTCCGAGCAGAAGAAGAANGG

and i found only 26 off-targets and 1 on-target.

also some off-targets are found only by BWA and not found by cas-offinder.

i attach all the result files so you can check, maybe i'm doing something wrong.

BWA_results.zip https://github.com/maximilianh/crisporPaper/files/3089731/BWA_results.zip CAS-OFFINDER_emx1.4mm.NGG.zip https://github.com/maximilianh/crisporPaper/files/3089733/CAS-OFFINDER_emx1.4mm.NGG.zip

thank you a lot. Samuele.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporPaper/issues/2#issuecomment-484077208, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TVA9lPUUYjbkpXxKEqc67vcDLMEGks5vhxzrgaJpZM4c0jNJ .

samuelecancellieri commented 5 years ago

How many offtargets does crispor find when you set the mit score cutoff to 0? On Wed 17 Apr 2019 at 06:08, Samuele Cancellieri @.***> wrote: hi, thanks for the very fast reply. i'm trying using the EMX1 guide with NGG PAM (GAGTCCGAGCAGAAGAAGAANGG) on the hg19 reference genome. i try a search using cas-offinder set with 4 mismatches and no bulges, and i found 298 off-targets and 1 on-target. i call BWA with these commands: bwa aln -o 0 -m 1200000000000 -n 4 -k 4 -N -l 20 hg19.fa emx1.fa > 4mm.bwa.sai bwa samse -n 600000000000000000 hg19.fa 4mm.bwa.sai emx1.fa | perl xa2multi.pl > 4mm.bwa.sam samtools view -Sb 4mm.bwa.sam > 4mm.bwa.bam bedtools bamtobed -i 4mm.bwa.bam > 4mm.bwa.bed the emx1.fa is this file: EMX1_NGG GAGTCCGAGCAGAAGAAGAANGG and i found only 26 off-targets and 1 on-target. also some off-targets are found only by BWA and not found by cas-offinder. i attach all the result files so you can check, maybe i'm doing something wrong. BWA_results.zip https://github.com/maximilianh/crisporPaper/files/3089731/BWA_results.zip CAS-OFFINDER_emx1.4mm.NGG.zip https://github.com/maximilianh/crisporPaper/files/3089733/CAS-OFFINDER_emx1.4mm.NGG.zip thank you a lot. Samuele. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TVA9lPUUYjbkpXxKEqc67vcDLMEGks5vhxzrgaJpZM4c0jNJ .

i have not tried crispor at the moment, you think the problem is the score? because i try only using BWA without any script or modifications to the code.

maximilianh commented 5 years ago

Yes, you're making a mistake here. BWA does not understand the NGG at the end, it doesn't understand "N". remove the NGG, re-run BWA, use the BED file to remove all hits with the suffix NGG and it should be almost identical to casofffinder (with the exception of overlapping off-targets).

Let me know if something doesn't work as expected.

samuelecancellieri commented 5 years ago

Yes, you're making a mistake here. BWA does not understand the NGG at the end, it doesn't understand "N". remove the NGG, re-run BWA, use the BED file to remove all hits with the suffix NGG and it should be almost identical to casofffinder (with the exception of overlapping off-targets). Let me know if something doesn't work as expected.

I understand, so i need to call BWA to aln the guide on the genome, then filter out all the sequence with the correct PAMs. i will try.

thanks for the reply and the help.

Samuele

maximilianh commented 5 years ago

I imagine you thought that it'd work because bowtie or another aligner understands wildcards. BWA doesn't and I had some explanation but I totally forgot, sorry. The filtering is not super slow.

If you happen to know any other aligner that could be used here, I'm curious. bowtie definitely doesn't work. There must be tons of others that do work, it's just that I never spent enough time playing with all these aligners out there...

On Wed, Apr 17, 2019 at 6:30 AM Samuele Cancellieri < notifications@github.com> wrote:

Yes, you're making a mistake here. BWA does not understand the NGG at the end, it doesn't understand "N". remove the NGG, re-run BWA, use the BED file to remove all hits with the suffix NGG and it should be almost identical to casofffinder (with the exception of overlapping off-targets). Let me know if something doesn't work as expected. … <#m3458468983826338902>

I understand, so i need to call BWA to aln the guide on the genome, then filter out all the sequence with the correct PAMs. i will try.

thanks for the reply and the help.

Samuele

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporPaper/issues/2#issuecomment-484090573, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TbPs1t1OjCP8cJXp2hB07DmjEFzvks5vhyIIgaJpZM4c0jNJ .

samuelecancellieri commented 5 years ago

I imagine you thought that it'd work because bowtie or another aligner understands wildcards. BWA doesn't and I had some explanation but I totally forgot, sorry. The filtering is not super slow. If you happen to know any other aligner that could be used here, I'm curious. bowtie definitely doesn't work. There must be tons of others that do work, it's just that I never spent enough time playing with all these aligners out there... On Wed, Apr 17, 2019 at 6:30 AM Samuele Cancellieri < @.***> wrote: Yes, you're making a mistake here. BWA does not understand the NGG at the end, it doesn't understand "N". remove the NGG, re-run BWA, use the BED file to remove all hits with the suffix NGG and it should be almost identical to casofffinder (with the exception of overlapping off-targets). Let me know if something doesn't work as expected. … <#m3458468983826338902> I understand, so i need to call BWA to aln the guide on the genome, then filter out all the sequence with the correct PAMs. i will try. thanks for the reply and the help. Samuele — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TbPs1t1OjCP8cJXp2hB07DmjEFzvks5vhyIIgaJpZM4c0jNJ .

hi, i'm not very confident with alignment software because i used them only few times during my master degree. But now, i'm currently developing a software that is created for crispr off-target prediction, so if you want to test it when will be ready, it will be a pleasure for me. I have another question, i try to use BWA as you suggest, so removing the PAM from the fasta file, and using only the guide, but now i have a file with all the possibile alignment of the guide to the genome, and the output file does not contain the sequence extracted from the genome, so how i can filter the sequence for the ones ending with NGG??

image

this is the output file from BWA.

thank you a lot, i hope i'm not bothering you too much.

Samuele

maximilianh commented 5 years ago

If you don't know how to extract sequence from the genome, have a look at the crispor.py source code. You can find the commands there. It's a very common problem in genomics.

samuelecancellieri commented 5 years ago

thank you. i will check the code.