mckennalab / FlashFry

FlashFry: The rapid CRISPR target site characterization tool
Other
64 stars 10 forks source link

unexpected result #31

Closed Huanle closed 2 years ago

Huanle commented 2 years ago

Hi @aaronmck ,

I ran into an issue that I do not expect and can not explain. I attached an image containing the relevant result. As you can see there is a 0-mismatch off-target in the result. Interestingly, it happens to be within my target gene.

So I quickly checked the sequence using bedtools getfasta. (It seems coordinates in flashfry results are 1-based) I got these:

>chr3:52211336-52211359
ACACCGTACGGTCCTATGCTGCT
>chr3:52211337-52211360
CACCGTACGGTCCTATGCTGCTG
>chr3:52211338-52211361
ACCGTACGGTCCTATGCTGCTGG
>chr3:52211339-52211362
CCGTACGGTCCTATGCTGCTGGC

This indicates that the 0-mismatch off-target is supposed to be the target. Am I right? If so, then this result from flashfry output is incorrect.

Can you please help me to sort this out? Thanks a lot!

image

aaronmck commented 2 years ago

Thanks for the report Huanle! I'm trying to figure this out, it certainly looks like a bug. I'm making a simple test case to track it down, as I can't clearly see where the +1 offset is getting added to the off-target sequence positions. Hopefully I'll find the issue today and update with a new version.

aaronmck commented 2 years ago

I can't seem to reproduce this with a simple example genome (the off-target positions are correct and match the in-genome target). Could you let me know what genome you're using? I can try to see if I can recreate your exact issue.

I also tried with a single guide against the mouse genome (ROSA26 primer CTAAAGAAGAGGCTGTGCTTTGG) and it got the correct zero-based positions. Are you running all of chromosome 3 against itself here?

aaronmck commented 2 years ago

I found the above guide in human HG38 for your entry:

chr3:52211338-52211361 ACCGTACGGTCCTATGCTGCTGG

Which is the zero-index representation of the right position that I get when BLAT'ing the sequence at NCBI:

Screen Shot 2022-07-07 at 9 34 01 PM

So the off-target should be at position 52211338; I'm not sure why your target is off by one position (52211337), how are you setting up your input fasta file to find targets?

Huanle commented 2 years ago

Hi @aaronmck ,

thanks a lot for your quick response. You are right that this guide is designed using hg38 as the reference genome.

Below are the commands I used for the guide designing and off-target prediction and I attached the input files (including all but the indexed database) used in these commands. Hope this helps you and ultimately me to sort it out.

Best, Huanle

exe="../flashfry/FlashFry-assembly-1.12.jar"
db="../hg38_cas9**NRG**_database" 
fasta="test.fa" 

java -Xmx8g -jar ${exe} \
 discover \
 --flankingSequence=10 \
 --maxMismatch 4 \
 --maximumOffTargets=4000    \
 --positionOutput \
 --database ${db} \
 --fasta ${fa} \
 --output ${out}

java -Xmx8g -jar ${exe} \
    score \
    --database ${db} \
    --includeOTs \
    --input ${out} \
    --output ${out}.score \
    --inputAnnotationBed GeneFeature:${ano_bed} \
    --transformPositions ${seq_bed} \
    -scoringMetrics doench2014ontarget,moreno2015,JostandSantos,doench2016cfd,dangerous,hsu2013,minot,folding,rank,bedannotator

PS: you need to change the file suffix to .tgz and uncompress it.

4_debug.txt

aaronmck commented 2 years ago

Thanks, this was helpful! The issue is just zero-based vs one-based BED files. When I first wrote the tool it seemed more people assumed BED files were one-based (besides UCSC), but it seems things have changed and the standard is now zero-based. I've updated the code to be zero-based, there's a new 1.14 release that should fix this. Your target and "off-target" is now at 52211338 using your bed files.

Huanle commented 2 years ago

Hi @aaronmck ,

thanks a lot for your effort. It seems to be fixed. Is there any specific reason that you changed the output format by placing Doench2016CFDScore into the off-targets column rather than in an independent column as in the past?

aaronmck commented 2 years ago

I think it's in both places? DoenchCFD_maxOT and DoenchCFD_specificityscore are per target scores, whereas the Doench2016CFDScore is a per-off-target score? Maybe something got mangled here?

Huanle commented 2 years ago

hmm. i am not sure.

Another thing that seems to have been changed is to include the on-target into the offTargets. Maybe i was like this before and i failed to notice it?

Thanks a lot for your efforts.

aaronmck commented 2 years ago

Sure, the on-target should always be an 'off-target' if you're running against the same genome. Some scoring schemes take this into account, others (like Hsu et al.) need to be told what to do; we default to not including it in the score unless asked to (--countOnTargetInScore parameter).

Huanle commented 2 years ago

Hi @aaronmck ,

I found exceptions similar to what is shown in the attached image: the number of 0-mismatch off-target is 0 instead of q.

In my case, two-thirds of guides have 1-perfect off-target (aka, the guide/target themselves) and one-third have no perfectly matched off-target.

image

aaronmck commented 2 years ago

Your target overflowed (OVERFLOW set) which mean not all off-targets were captured. We do this to avoid processing targets that won't be suitable candidates. You can set the maximum with the --maximumOffTargets parameter.

Huanle commented 2 years ago

that makes sense.

Thanks a lot @aaronmck