maximilianh / crisporWebsite

All source code of the crispor.org website

http://crispor.org

Other

68 stars 43 forks source link

certain genomic target sequences not found in genome and crash lindel calculation #35

Closed genya closed 4 years ago

genya commented 4 years ago

Found that rarely certain genomic ranges crash the command line version (error output pasted at bottom); while on the website, inputting these sequences produces "Query sequence, not found in the selected genome, Homo sapiens (hg38)" output, even though inputting the genomic coordinates pulls up the target sequence.

>ENSG00000286185 AC242842.3 exon ENST00000621744.4_5 range=chr1:149482155-149482246 strand=+ ttctctgaatttatttacagAAAATGAAAGTGATGATGAGGAAGAGGAAGAAAAAGGGCCAGTGTCTCCCAGgtaatgttgtggaattgttg >ENSG00000261832 AC138894.1 exon ENST00000637378.1_8 range=chr16:28458366-28458441 strand=- tcatgtgttggctttttcagATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGgtaagtttgtttgttttcct >ENSG00000278662 GOLGA6L10 exon ENST00000610657.1_3 range=chr15:82346541-82346661 strand=- tctctctgcatgcacctcagAGCCAGTACCAAGAACTAGCAGTGGCCCTGGATTCAAGCTCCGCAATAATCAGTCAACTCACTGAAAACATCAATTCACTGgtaagagtccagtggggtcc >ENSG00000261247 GOLGA8T exon ENST00000569052.1_5 range=chr15:30139339-30139426 strand=+ ctgtcttcctcttcctacagGAAAAGAAAGCAAACAACAAGAAACAGAAAGCCAAAAGGGTGCTAGAGgtgagtggagggtgtgcagt >ENSG00000204172 AGAP9 exon ENST00000452145.6_1 range=chr10:47522846-47522954 strand=- ttctccctctatacatatagCTTTGGAGTTTAACCTTTCTGCCAATCCAGAGGCAAGCACAATATTCCAGAGGAACTCTCAAACAGATGgtgagacaacagtgtctgta >ENSG00000125498 KIR2DL1 exon ENST00000336077.11_0 range=chr19:54769831-54769904 strand=+ ctgtctgctccggcagcaccATGTCGCTCTTGGTCGTCAGCATGGCGTGTGTTGgtgagtcctggaaagcaata >ENSG00000240038 AMY2B exon ENST00000361355.8_0,ENST00000610648.1_0 range=chr1:103571583-103571790 strand=+ actgacaacttcaaagcaaaATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTTAGCTCCCAAGGGATTTGGAGGGGTTCAGgtgggtatgattcatagtat >ENSG00000263956 NBPF11 exon ENST00000615281.4_10 range=chr1:148114417-148114508 strand=- ttctctgaatttatttacagAAAATGAAAGTGATGATGAGGAAGAGGAAGAAAAAGGGCCAGTGTCTCCCAGgtaatgttgtggaattgttg >ENSG00000269713 NBPF9 exon ENST00000615421.4_13,ENST00000584027.8_13 range=chr1:149063613-149063825 strand=- actttttcccacttttccagGCTCAGCAGGGAGCTGCTGGATGAGAAAGGGCCTGAAGTCTTGCAGGACTCACTGGATAGATGTTATTCAACTCCTTCAGGTTGTCTTGAACTGACTGACTCATGCCAGCCCTACAGAAGTGCCTTTTACGTATTGGAGCAACAGCGTGTTGGCTTGGCTGTTGACATGGATGgtgagtacctttctatgaag >ENSG00000260691 ANKRD20A1 exon ENST00000562196.5_9 range=chr9:67887256-67887324 strand=+ atatcccctttgctttgtagGGCCTCCTGCAAAACATCCTTCCTTGAAGgtaattaattatgtatattt

INFO:root: running on sequence 'ENSG00000286185 AC242842.3 exon ENST00000621744.4_5 range=chr1:149482155-149482246 strand=+', guideLen=20, seqLen=92 INFO:root:Progress sTFqCkJ9RFySD1swZXbw - bwasw - Searching genome for one 100% identical match to input sequence [M::bwa_idx_load_from_disk] read 0 ALT contigs [bsw2_aln] read 1 sequences/pairs (92 bp) ... [main] Version: 0.7.15-r1140 [main] CMD: /lab/solexa_sabatini/genya/crisporWebsite6/bin/Linux/bwa bwasw -T 20 /lab/solexa_sabatini/genya/crisporWebsite6/genomes/hg38/hg38.fa /tmp/crisporBestMatchttMxuE.fa [main] Real time: 5.965 sec; CPU: 5.857 sec INFO:root:Progress sTFqCkJ9RFySD1swZXbw - effScores - Calculating guide efficiency scores INFO:root:Progress sTFqCkJ9RFySD1swZXbw - outcome - Calculating editing outcomes Traceback (most recent call last): File "crispor.py", line 8304, in main() File "crispor.py", line 8301, in main mainCommandLine() File "crispor.py", line 8110, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4300, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3845, in processSubmission createBatchEffScoreTable(batchId, queue) File "crispor.py", line 3461, in createBatchEffScoreTable guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue) File "crispor.py", line 3402, in calcSaveEffScores mutScores = crisporEffScores.calcMutSeqs(pamIds, longSeqs, enz, scoreNames=mutScoreNames) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 1311, in calcMutSeqs mutSeqDict = calcLindelScore(seqIds, seqs) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 748, in calcLindelScore return runLindel(seqIds, trimSeqs(seqs, -33, 27)) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 724, in runLindel y_hat, fs = Lindel.Predictor.gen_prediction(seq,weights,prerequesites) ValueError: too many values to unpack Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs func(targs, **kargs) File "crispor.py", line 7868, in delBatchDir raise Exception("cowardly refusing to remove many temp files") Exception: cowardly refusing to remove many temp files

maximilianh commented 4 years ago

Wow. That is an unexpected error. I'll simply increase the "coward" limit... :-)

thanks!

genya commented 4 years ago

When I commented out the "coward" raise exception lines, the code still crashed. I think the issue is upstream when "ValueError: too many values to unpack" occurs.

genya commented 4 years ago

I believe I found the issue: all these problematic sequences have annotations in the repeat track of the genome browser, and CRISPOR seems to align to a repeat-masked reference genome. This would explain why when queries such as chr5:1318349-1318474 are submitted to the website, the corresponding sequence is retrieved (presumably from a non-repeat masked reference genome) but then the outputs have the warning "Query sequence, not found in the selected genome, Homo sapiens (hg38)".

This also would explain why these types of sequences can crash the Lindel calculation and contain guides with NotEnoughFlankSequence errors for the Doench calculation -- the additional needed flanking sequence cannot be retrieved.

From the user end, one solution is to never run CRISPOR on repeat masked regions, but if it's computationally feasible to use a non-repeat-masked reference genome for alignment, then these regions could be treated as any others and the guides would simply have appropriately lower specificity scores.

These repeat regions do contain conserved protein coding sequences. Here are some examples: chr5:1318349-1318474 chr6:32035116-32035298 chr7:103086610-103086785 chr9:65698495-65698582 chr15:82349272-82349396 chr17:36211756-36211911

maximilianh commented 4 years ago

Hm, no, Crispor does not align to a masked genome. Otherwise it wouldn't be able to handle 50% of the genome. Also, this would be a really bad idea overall, ignoring 50% of the genome.

There are many repeats that work just fine in crispor, right?

Isn't this just a Lindel problem? Maybe I've misunderstood what the root of the problem is...?

On Mon, Sep 23, 2019 at 2:27 PM Evgeni Frenkel notifications@github.com wrote:

I believe I found the issue: all these problematic sequences have annotations in the repeat track of the genome browser, and CRISPOR seems to align to a repeat-masked reference genome. This would explain why when queries such as chr5:1318349-1318474 are submitted to the website, the corresponding sequence is retrieved (presumably from a non-repeat masked reference genome) but then the outputs have the warning "Query sequence, not found in the selected genome, Homo sapiens (hg38)".

This also would explain why these types of sequences can crash the Lindel calculation and contain guides with NotEnoughFlankSequence errors for the Doench calculation -- the additional needed flanking sequence cannot be retrieved.

From the user end, one solution is to never run CRISPOR on repeat masked regions, but if it's computationally feasible to use a non-repeat-masked reference genome for alignment, then these regions could be treated as any others and the guides would simply have appropriately lower specificity scores.

Here are some more examples: chr5:1318349-1318474 chr6:32035116-32035298 chr7:103086610-103086785 chr9:65698495-65698582 chr15:82349272-82349396 chr17:36211756-36211911 chr22:18939653-18939723

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TN6YRZG3F5M7RKNSX3QLCY2JA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7KV2IA#issuecomment-534076704, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TPKKHHPZFPT7JWIYMDQLCY2JANCNFSM4IPPJ6AA .

genya commented 4 years ago

initially I noticed that some of the guides within those regions crash the lindel calculation, which is why I mentioned that initially, but since then I've noticed that these regions also cause the Doench score calculator to return NotEnoughFlankSequnce. The same is true on the CRISPOR website: inputs like chr5:1318349-1318474 produce a warning "Query sequence, not found in the selected genome, Homo sapiens (hg38)" even though the software does find the target sequence and generates the list of candidate guides for it. Of those guides, the ones that are close to the edge lack efficiency scores. So looks like the issue is flanking sequence cannot be found for the efficiency and frameshift calculations.

I collected several examples like these (six of which are listed in the prev post) and examined them in the genome browser. The only thing I could see that distinguished them was that they have annotations in the repeats track (in the case of chr5:1318349-1318474, the annotation is human chain self alignment ID 2562). I then noticed that under the "Adding a Genome" section of the git documentation page, the example provided for the frog genome

X. laevis genome: sudo crisprAddGenome fasta /tmp2/LAEVIS_7.1.repeatMasked.fa --desc 'xenBaseLaevis71|Xenopus laevis|X. laevis|Xenbase V7.1' --gff geneModels.gff3

used a file with "repeatMasked.fa" suffix so I guessed that CRISPOR is using a repeat masked genome, at least for finding flanking sequence for the efficiency calculations. Most or all of the example problematic sequences I listed also map to alternative contigs, so that may be the root of the problem instead or part of it.

I also noticed that in some of these cases, there's huge discrepancy between the MIT specificity score and the CFD specificity score. For example on the CRISPOR website, the input chr6:32035116-32035298, produces guides with MIT specificity scores <50 and CFD scores >90. The MIT scores are correct because this sequence is duplicated at chr6:32002380-32002561.

maximilianh commented 4 years ago

Thanks!

Could it be that it’s impossible to place these sequences onto one single location in the genome ? If they have perfect duplicates then crispor may not know where to place them...

On Mon 23 Sep 2019 at 18:50, Evgeni Frenkel notifications@github.com wrote:

initially I noticed that some of the guides within those regions crash the lindel calculation, which is why I mentioned that initially, but since then I've noticed that these regions also cause the Doench score calculator to return NotEnoughFlankSequnce. The same is true on the CRISPOR website: inputs like chr5:1318349-1318474 produce a warning "Query sequence, not found in the selected genome, Homo sapiens (hg38)" even though the software does find the target sequence and generates the list of candidate guides for it. Of those guides, the ones that are close to the edge lack efficiency scores. So it's pretty clear that the issue is that the flanking sequence cannot be found for the efficiency and frameshift calculations.

I collected several examples like these (six of which are listed in the prev post) and examined them in the genome browser. The only thing I could see that distinguished them was that they have annotations in the repeats track (in the case of chr5:1318349-1318474, the annotation is human chain self alignment ID 2562). I then noticed that under the "Adding a Genome" section of the git documentation page, the example provided for the frog genome

X. laevis genome: sudo crisprAddGenome fasta /tmp2/LAEVIS_7.1.repeatMasked.fa --desc 'xenBaseLaevis71|Xenopus laevis|X. laevis|Xenbase V7.1' --gff geneModels.gff3

used a file with "repeatMasked.fa" suffix so I guessed that CRISPOR is using a repeat masked genome, at least for finding flanking sequence for the efficiency calculations. These sequences also often map to alternative contigs, so that may be the root of the problem instead or part of it.

I also noticed that in some of these cases, there's huge discrepancy between the MIT specificity score and the CFD specificity score. For example on the CRISPOR website, the input chr6:32035116-32035298, produces guides with MIT specificity scores <50 and CFD scores >90. The MIT scores are correct because this sequence is duplicated at chr6:32002380-32002561.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TNORMWJA55VWDR65MLQLDXV5A5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7LQP7A#issuecomment-534185980, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TIME3TQVNK6WZRO7ZDQLDXV5ANCNFSM4IPPJ6AA .

genya commented 4 years ago

In some cases the only other perfect match is to an alternative contig, e.g. chr5:1318349-1318474 matches a segment on chr5_KI270791v1_alt. I thought the alternative contigs were excluded from the alignment? If not, how to exclude them? For command-line crispor, the files hg38.sizes and hg38.segments.bed both contain alt contig entries. Would it be enough to remove them? I'm on a server that lacks one of the packages (mysqldb) needed to run crisprAddGenome. It may be that removing alt contigs fixes the problem in most cases. It makes sense that not knowing where to align the target sequence would make the task of obtaining flanking sequence impossible... For the crispor website, the problem still arises when specifying the target sequence with genomic coordinates, though in principle in those cases the flanking sequence is unambiguous.

-- Evgeni (Genya) Frenkel, PhD Whitehead Institute for Biomedical Research Lab of David M Sabatini http://sabatinilab.wi.mit.edu/membersDS.html

On Mon, Sep 23, 2019 at 3:31 PM Maximilian Haeussler < notifications@github.com> wrote:

Thanks!

Could it be that it’s impossible to place these sequences onto one single location in the genome ? If they have perfect duplicates then crispor may not know where to place them...

On Mon 23 Sep 2019 at 18:50, Evgeni Frenkel notifications@github.com wrote:

initially I noticed that some of the guides within those regions crash the lindel calculation, which is why I mentioned that initially, but since then I've noticed that these regions also cause the Doench score calculator to return NotEnoughFlankSequnce. The same is true on the CRISPOR website: inputs like chr5:1318349-1318474 produce a warning "Query sequence, not found in the selected genome, Homo sapiens (hg38)" even though the software does find the target sequence and generates the list of candidate guides for it. Of those guides, the ones that are close to the edge lack efficiency scores. So it's pretty clear that the issue is that the flanking sequence cannot be found for the efficiency and frameshift calculations.

I collected several examples like these (six of which are listed in the prev post) and examined them in the genome browser. The only thing I could see that distinguished them was that they have annotations in the repeats track (in the case of chr5:1318349-1318474, the annotation is human chain self alignment ID 2562). I then noticed that under the "Adding a Genome" section of the git documentation page, the example provided for the frog genome

X. laevis genome: sudo crisprAddGenome fasta /tmp2/LAEVIS_7.1.repeatMasked.fa --desc 'xenBaseLaevis71|Xenopus laevis|X. laevis|Xenbase V7.1' --gff geneModels.gff3

used a file with "repeatMasked.fa" suffix so I guessed that CRISPOR is using a repeat masked genome, at least for finding flanking sequence for the efficiency calculations. These sequences also often map to alternative contigs, so that may be the root of the problem instead or part of it.

I also noticed that in some of these cases, there's huge discrepancy between the MIT specificity score and the CFD specificity score. For example on the CRISPOR website, the input chr6:32035116-32035298, produces guides with MIT specificity scores <50 and CFD scores >90. The MIT scores are correct because this sequence is duplicated at chr6:32002380-32002561.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub < https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TNORMWJA55VWDR65MLQLDXV5A5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7LQP7A#issuecomment-534185980 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AACL4TIME3TQVNK6WZRO7ZDQLDXV5ANCNFSM4IPPJ6AA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AABRLZXOOA4ETCWS7Q5FN6TQLEKRXA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7L77UI#issuecomment-534249425, or mute the thread https://github.com/notifications/unsubscribe-auth/AABRLZQCCMF27HPIKPX5HYLQLEKRXANCNFSM4IPPJ6AA .

genya commented 4 years ago

please disregard my last post because realized an easy workaround is to provide 50bp flanking sequence as part of the input. Then the remaining issue is really just the minor one that the Lindel score seems to crash when it cannot find flanking sequence because of the ambiguous alignment. In those cases the Doench score returns a NotEnoughFlankingSequence string instead of an efficiency score. The Lindel score could do the same instead of crashing but it works in most cases as it is. Thanks for your help troubleshooting this!

maximilianh commented 4 years ago

Thanks! Sorry, I didn't have time yesterday to chime in and honestly, I totally forgot that adding flanking sequences is the right solution.

Hm, I think this has come up before: shouldn't hg38 by default not include the ALT sequences? I think it makes little sense to include ALTs in there anyways, wouldn't that solve the major part of this problem?

On Wed, Sep 25, 2019 at 3:24 PM Evgeni Frenkel notifications@github.com wrote:

please disregard my last post because realized an easy workaround is to provide 50bp flanking sequence as part of the input. Then the remaining issue is really just the minor one that the Lindel score seems to crash when it cannot find flanking sequence because of the ambiguous alignment. In those cases the Doench score returns a NotEnoughFlankingSequence string instead of an efficiency score. The Lindel score could do the same instead of crashing but it works in most cases as it is. Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TNLBA2XM4YCXDH5FXLQLNQ7DA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7R4FUQ#issuecomment-535020242, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TOYI4YPAXQWYHVEG3TQLNQ7DANCNFSM4IPPJ6AA .

maximilianh commented 4 years ago

Also, when you align, usually you use a special version of the genome without the ALTs... so crispor probably should do that, too...

On Wed, Sep 25, 2019 at 4:12 PM Maximilian Haeussler maximilianh@gmail.com wrote:

Thanks! Sorry, I didn't have time yesterday to chime in and honestly, I totally forgot that adding flanking sequences is the right solution.

Hm, I think this has come up before: shouldn't hg38 by default not include the ALT sequences? I think it makes little sense to include ALTs in there anyways, wouldn't that solve the major part of this problem?

On Wed, Sep 25, 2019 at 3:24 PM Evgeni Frenkel notifications@github.com wrote:

please disregard my last post because realized an easy workaround is to provide 50bp flanking sequence as part of the input. Then the remaining issue is really just the minor one that the Lindel score seems to crash when it cannot find flanking sequence because of the ambiguous alignment. In those cases the Doench score returns a NotEnoughFlankingSequence string instead of an efficiency score. The Lindel score could do the same instead of crashing but it works in most cases as it is. Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TNLBA2XM4YCXDH5FXLQLNQ7DA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7R4FUQ#issuecomment-535020242, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TOYI4YPAXQWYHVEG3TQLNQ7DANCNFSM4IPPJ6AA .

genya commented 4 years ago

I just re-installed CRISPOR to make sure my version is up-to-date and the default hg38 definitely still includes the _alt and _random contigs, as well as chrM and chrUn.

Also guides targeting PAR regions are still getting falsely assigned perfectly-matching off-target sites. Example Input: gtggcttctcagttgctctctccttttgcagAAAGAACTGGAAGCCGCCTGGGGCGTGGAGGTGTTTGACCGCTTCACGGTCGTCCTGCACATCTTCCGCTGTAACGCCCGCACGAAGGAGGCCCGGCTTCAGGTGGCCCTGGCGGAGATGCCGCTGCACAG

Example output (for guide CCGCTGTAACGCCCGCACGA whose only exact match according to BLAT is in the PAR of X and Y)

guideId targetSeq mitSpecScore offtargetCount targetGenomeGeneLocus Doench '16-Score Moreno-Mateos-Score Out-of-Frame-Score Lindel-Score 136forw CCGCTGTAACGCCCGCACGAAGG 49 35 59 53 74 90

guideId guideSeq offtargetSeq mismatchPos mismatchCount mitOfftargetScore cfdOfftargetScore chrom start end strand locusDesc 136forw CCGCTGTAACGCCCGCACGAAGG CCGCTGTAACGCCCGCACGAAGG .................... 0 100 1 chrY 314933 314955 - PAR1 exon:GTPBP6

maximilianh commented 4 years ago

Yes, I have never changed the hg38 genome, as I was afraid of touching it at this late stage.

I just realized that NCBI provides the "analysis set" for hg38: it has one copy of the PAR regions hard-masked and all the alts removed (though not the randoms). Wouldn't that be a much better default as the hg38 genome?

On Wed, Sep 25, 2019 at 4:32 PM Evgeni Frenkel notifications@github.com wrote:

I just re-installed CRISPOR to make sure my version is up-to-date and the default hg38 definitely still includes the _alt and _random contigs, as well as chrM and chrUn.

Also guides targeting PAR regions are still getting falsely assigned perfectly-matching off-target sites. Example Input:

PARregionTest

gtggcttctcagttgctctctccttttgcagAAAGAACTGGAAGCCGCCTGGGGCGTGGAGGTGTTTGACCGCTTCACGGTCGTCCTGCACATCTTCCGCTGTAACGCCCGCACGAAGGAGGCCCGGCTTCAGGTGGCCCTGGCGGAGATGCCGCTGCACAG

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TKXQOVXDCL5ICWLTCLQLNY7JA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SD4SY#issuecomment-535051851, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TO7PNI5ZIUEILRKLBTQLNY7JANCNFSM4IPPJ6AA .

genya commented 4 years ago

Yes that would be much better! Is there a way to exclude regions by editing the bed files?

On Wed, Sep 25, 2019 at 10:45 AM Maximilian Haeussler < notifications@github.com> wrote:

Yes, I have never changed the hg38 genome, as I was afraid of touching it at this late stage.

I just realized that NCBI provides the "analysis set" for hg38: it has one copy of the PAR regions hard-masked and all the alts removed (though not the randoms). Wouldn't that be a much better default as the hg38 genome?

On Wed, Sep 25, 2019 at 4:32 PM Evgeni Frenkel notifications@github.com wrote:

I just re-installed CRISPOR to make sure my version is up-to-date and the default hg38 definitely still includes the _alt and _random contigs, as well as chrM and chrUn.

Also guides targeting PAR regions are still getting falsely assigned perfectly-matching off-target sites. Example Input:

PARregionTest

gtggcttctcagttgctctctccttttgcagAAAGAACTGGAAGCCGCCTGGGGCGTGGAGGTGTTTGACCGCTTCACGGTCGTCCTGCACATCTTCCGCTGTAACGCCCGCACGAAGGAGGCCCGGCTTCAGGTGGCCCTGGCGGAGATGCCGCTGCACAG

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TKXQOVXDCL5ICWLTCLQLNY7JA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SD4SY#issuecomment-535051851 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AACL4TO7PNI5ZIUEILRKLBTQLNY7JANCNFSM4IPPJ6AA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AABRLZSBMM33MGKXYUJ6QTTQLN2P7A5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SFL2A#issuecomment-535057896, or mute the thread https://github.com/notifications/unsubscribe-auth/AABRLZRROLYFNXXAJWPS52TQLN2P7ANCNFSM4IPPJ6AA .

-- -- Evgeni (Genya) Frenkel, PhD Whitehead Institute for Biomedical Research Lab of David M Sabatini http://sabatinilab.wi.mit.edu/membersDS.html

maximilianh commented 4 years ago

No there is no BED-based removal. You could add them to the regions file and add some logic to remove them, but it I think it's much easier to use the "official" NCBI assembly recommended for analysis. I must admit I had no idea of the whole ALT business until recently.

On Wed, Sep 25, 2019 at 4:55 PM Evgeni Frenkel notifications@github.com wrote:

Yes that would be much better! Is there a way to exclude regions by editing the bed files?

On Wed, Sep 25, 2019 at 10:45 AM Maximilian Haeussler < notifications@github.com> wrote:

Yes, I have never changed the hg38 genome, as I was afraid of touching it at this late stage.

I just realized that NCBI provides the "analysis set" for hg38: it has one copy of the PAR regions hard-masked and all the alts removed (though not the randoms). Wouldn't that be a much better default as the hg38 genome?

On Wed, Sep 25, 2019 at 4:32 PM Evgeni Frenkel <notifications@github.com

wrote:

I just re-installed CRISPOR to make sure my version is up-to-date and the default hg38 definitely still includes the _alt and _random contigs, as well as chrM and chrUn.

Also guides targeting PAR regions are still getting falsely assigned perfectly-matching off-target sites. Example Input:

PARregionTest

gtggcttctcagttgctctctccttttgcagAAAGAACTGGAAGCCGCCTGGGGCGTGGAGGTGTTTGACCGCTTCACGGTCGTCCTGCACATCTTCCGCTGTAACGCCCGCACGAAGGAGGCCCGGCTTCAGGTGGCCCTGGCGGAGATGCCGCTGCACAG

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TKXQOVXDCL5ICWLTCLQLNY7JA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SD4SY#issuecomment-535051851

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AACL4TO7PNI5ZIUEILRKLBTQLNY7JANCNFSM4IPPJ6AA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AABRLZSBMM33MGKXYUJ6QTTQLN2P7A5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SFL2A#issuecomment-535057896 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AABRLZRROLYFNXXAJWPS52TQLN2P7ANCNFSM4IPPJ6AA

.

--

Evgeni (Genya) Frenkel, PhD Whitehead Institute for Biomedical Research Lab of David M Sabatini http://sabatinilab.wi.mit.edu/membersDS.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TKFTXGQS3ZVK6YBMTDQLN3XHA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SGSPI#issuecomment-535062845, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TNLJWVOR2NEXPNQEI3QLN3XHANCNFSM4IPPJ6AA .

genya commented 4 years ago

Ok that sounds good enough. Whenever you get a chance to update the default hg38 will be great — thank you so much!

On Wed, Sep 25, 2019 at 11:05 AM Maximilian Haeussler < notifications@github.com> wrote:

No there is no BED-based removal. You could add them to the regions file and add some logic to remove them, but it I think it's much easier to use the "official" NCBI assembly recommended for analysis. I must admit I had no idea of the whole ALT business until recently.

On Wed, Sep 25, 2019 at 4:55 PM Evgeni Frenkel notifications@github.com wrote:

Yes that would be much better! Is there a way to exclude regions by editing the bed files?

On Wed, Sep 25, 2019 at 10:45 AM Maximilian Haeussler < notifications@github.com> wrote:

Yes, I have never changed the hg38 genome, as I was afraid of touching it at this late stage.

I just realized that NCBI provides the "analysis set" for hg38: it has one copy of the PAR regions hard-masked and all the alts removed (though not the randoms). Wouldn't that be a much better default as the hg38 genome?

On Wed, Sep 25, 2019 at 4:32 PM Evgeni Frenkel < notifications@github.com

wrote:

I just re-installed CRISPOR to make sure my version is up-to-date and the default hg38 definitely still includes the _alt and _random contigs, as well as chrM and chrUn.

Also guides targeting PAR regions are still getting falsely assigned perfectly-matching off-target sites. Example Input:

PARregionTest

gtggcttctcagttgctctctccttttgcagAAAGAACTGGAAGCCGCCTGGGGCGTGGAGGTGTTTGACCGCTTCACGGTCGTCCTGCACATCTTCCGCTGTAACGCCCGCACGAAGGAGGCCCGGCTTCAGGTGGCCCTGGCGGAGATGCCGCTGCACAG

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TKXQOVXDCL5ICWLTCLQLNY7JA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SD4SY#issuecomment-535051851

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AACL4TO7PNI5ZIUEILRKLBTQLNY7JANCNFSM4IPPJ6AA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AABRLZSBMM33MGKXYUJ6QTTQLN2P7A5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SFL2A#issuecomment-535057896

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AABRLZRROLYFNXXAJWPS52TQLN2P7ANCNFSM4IPPJ6AA

.

--

Evgeni (Genya) Frenkel, PhD Whitehead Institute for Biomedical Research Lab of David M Sabatini http://sabatinilab.wi.mit.edu/membersDS.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TKFTXGQS3ZVK6YBMTDQLN3XHA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SGSPI#issuecomment-535062845 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AACL4TNLJWVOR2NEXPNQEI3QLN3XHANCNFSM4IPPJ6AA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AABRLZUSLM2D4IDUY5QE44DQLN42JA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SHUKY#issuecomment-535067179, or mute the thread https://github.com/notifications/unsubscribe-auth/AABRLZUDM25B7GPOSWMN3JLQLN42JANCNFSM4IPPJ6AA .

-- -- Evgeni (Genya) Frenkel, PhD Whitehead Institute for Biomedical Research Lab of David M Sabatini http://sabatinilab.wi.mit.edu/membersDS.html

maximilianh commented 4 years ago

I wonder...Haven't we had this discussion before? Maybe I've had it with someone else in here...

I should re-index this other ideal hg38 and then provide it for download so you can grab it, right? That's OK with me, just confirming.

On Wed, Sep 25, 2019 at 5:07 PM Evgeni Frenkel notifications@github.com wrote:

Ok that sounds good enough. Whenever you get a chance to update the default hg38 will be great — thank you so much!

On Wed, Sep 25, 2019 at 11:05 AM Maximilian Haeussler < notifications@github.com> wrote:

No there is no BED-based removal. You could add them to the regions file and add some logic to remove them, but it I think it's much easier to use the "official" NCBI assembly recommended for analysis. I must admit I had no idea of the whole ALT business until recently.

On Wed, Sep 25, 2019 at 4:55 PM Evgeni Frenkel <notifications@github.com

wrote:

Yes that would be much better! Is there a way to exclude regions by editing the bed files?

On Wed, Sep 25, 2019 at 10:45 AM Maximilian Haeussler < notifications@github.com> wrote:

Yes, I have never changed the hg38 genome, as I was afraid of touching it at this late stage.

I just realized that NCBI provides the "analysis set" for hg38: it has one copy of the PAR regions hard-masked and all the alts removed (though not the randoms). Wouldn't that be a much better default as the hg38 genome?

On Wed, Sep 25, 2019 at 4:32 PM Evgeni Frenkel < notifications@github.com

wrote:

I just re-installed CRISPOR to make sure my version is up-to-date and the default hg38 definitely still includes the _alt and _random contigs, as well as chrM and chrUn.

Also guides targeting PAR regions are still getting falsely assigned perfectly-matching off-target sites. Example Input:

PARregionTest

gtggcttctcagttgctctctccttttgcagAAAGAACTGGAAGCCGCCTGGGGCGTGGAGGTGTTTGACCGCTTCACGGTCGTCCTGCACATCTTCCGCTGTAACGCCCGCACGAAGGAGGCCCGGCTTCAGGTGGCCCTGGCGGAGATGCCGCTGCACAG

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TKXQOVXDCL5ICWLTCLQLNY7JA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SD4SY#issuecomment-535051851

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AACL4TO7PNI5ZIUEILRKLBTQLNY7JANCNFSM4IPPJ6AA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AABRLZSBMM33MGKXYUJ6QTTQLN2P7A5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SFL2A#issuecomment-535057896

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AABRLZRROLYFNXXAJWPS52TQLN2P7ANCNFSM4IPPJ6AA

.

--

Evgeni (Genya) Frenkel, PhD Whitehead Institute for Biomedical Research Lab of David M Sabatini http://sabatinilab.wi.mit.edu/membersDS.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TKFTXGQS3ZVK6YBMTDQLN3XHA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SGSPI#issuecomment-535062845

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AACL4TNLJWVOR2NEXPNQEI3QLN3XHANCNFSM4IPPJ6AA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AABRLZUSLM2D4IDUY5QE44DQLN42JA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SHUKY#issuecomment-535067179 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AABRLZUDM25B7GPOSWMN3JLQLN42JANCNFSM4IPPJ6AA

.

--

Evgeni (Genya) Frenkel, PhD Whitehead Institute for Biomedical Research Lab of David M Sabatini http://sabatinilab.wi.mit.edu/membersDS.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TJSN74ZFLFKG2UWYQLQLN5D5A5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SH47Q#issuecomment-535068286, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TKVZLIEGGSS4TFEHULQLN5D5ANCNFSM4IPPJ6AA .

genya commented 4 years ago

Yes that would be great— thank you! I think I may have been one of the people to raise the issue of alt contigs before.

On Wed, Sep 25, 2019 at 11:43 AM Maximilian Haeussler < notifications@github.com> wrote:

I wonder...Haven't we had this discussion before? Maybe I've had it with someone else in here...

I should re-index this other ideal hg38 and then provide it for download so you can grab it, right? That's OK with me, just confirming.

On Wed, Sep 25, 2019 at 5:07 PM Evgeni Frenkel notifications@github.com wrote:

Ok that sounds good enough. Whenever you get a chance to update the default hg38 will be great — thank you so much!

On Wed, Sep 25, 2019 at 11:05 AM Maximilian Haeussler < notifications@github.com> wrote:

No there is no BED-based removal. You could add them to the regions file and add some logic to remove them, but it I think it's much easier to use the "official" NCBI assembly recommended for analysis. I must admit I had no idea of the whole ALT business until recently.

On Wed, Sep 25, 2019 at 4:55 PM Evgeni Frenkel < notifications@github.com

wrote:

Yes that would be much better! Is there a way to exclude regions by editing the bed files?

On Wed, Sep 25, 2019 at 10:45 AM Maximilian Haeussler < notifications@github.com> wrote:

Yes, I have never changed the hg38 genome, as I was afraid of touching it at this late stage.

I just realized that NCBI provides the "analysis set" for hg38: it has one copy of the PAR regions hard-masked and all the alts removed (though not the randoms). Wouldn't that be a much better default as the hg38 genome?

On Wed, Sep 25, 2019 at 4:32 PM Evgeni Frenkel < notifications@github.com

wrote:

I just re-installed CRISPOR to make sure my version is up-to-date and the default hg38 definitely still includes the _alt and _random contigs, as well as chrM and chrUn.

Also guides targeting PAR regions are still getting falsely assigned perfectly-matching off-target sites. Example Input:

PARregionTest

gtggcttctcagttgctctctccttttgcagAAAGAACTGGAAGCCGCCTGGGGCGTGGAGGTGTTTGACCGCTTCACGGTCGTCCTGCACATCTTCCGCTGTAACGCCCGCACGAAGGAGGCCCGGCTTCAGGTGGCCCTGGCGGAGATGCCGCTGCACAG

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TKXQOVXDCL5ICWLTCLQLNY7JA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SD4SY#issuecomment-535051851

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AACL4TO7PNI5ZIUEILRKLBTQLNY7JANCNFSM4IPPJ6AA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AABRLZSBMM33MGKXYUJ6QTTQLN2P7A5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SFL2A#issuecomment-535057896

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AABRLZRROLYFNXXAJWPS52TQLN2P7ANCNFSM4IPPJ6AA

.

--

Evgeni (Genya) Frenkel, PhD Whitehead Institute for Biomedical Research Lab of David M Sabatini http://sabatinilab.wi.mit.edu/membersDS.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <

https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TKFTXGQS3ZVK6YBMTDQLN3XHA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SGSPI#issuecomment-535062845

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AACL4TNLJWVOR2NEXPNQEI3QLN3XHANCNFSM4IPPJ6AA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AABRLZUSLM2D4IDUY5QE44DQLN42JA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SHUKY#issuecomment-535067179

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AABRLZUDM25B7GPOSWMN3JLQLN42JANCNFSM4IPPJ6AA

.

--

Evgeni (Genya) Frenkel, PhD Whitehead Institute for Biomedical Research Lab of David M Sabatini http://sabatinilab.wi.mit.edu/membersDS.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TJSN74ZFLFKG2UWYQLQLN5D5A5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SH47Q#issuecomment-535068286 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AACL4TKVZLIEGGSS4TFEHULQLN5D5ANCNFSM4IPPJ6AA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AABRLZU3Z7DPVCKJOO6CD6TQLOBINA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SLZWQ#issuecomment-535084250, or mute the thread https://github.com/notifications/unsubscribe-auth/AABRLZTEUA7WKQJEHWTJGALQLOBINANCNFSM4IPPJ6AA .

-- -- Evgeni (Genya) Frenkel, PhD Whitehead Institute for Biomedical Research Lab of David M Sabatini http://sabatinilab.wi.mit.edu/membersDS.html

genya commented 4 years ago

I downloaded the analysis version of the genome and its gff file to add it manually using tools/crisprAddGenome, but the process terminated with the following seemingly-trivial syntax error (after the BWT, BWA portions completed). This is in a python2 virtual environment. Should I be running python3 here?

INFO:root:Running: ['/bin/bash', '-e', '-o', 'pipefail', '-c', 'cat /tmp/hg38noAlts/hg38noAlts.gp | awk \'BEGIN {FS=OFS="\t";} // { gsub(" " , "", $1); gsub(" ", "", $12); print }\' | genePredToBed stdin stdout | bedToExons stdin stdout | sort -u | bedSort stdin stdout | bedOverlapMerge /dev/stdin /dev/stdout | bedBetween stdin /dev/stdout -a -s /tmp/hg38noAlts/hg38noAlts.sizes | bedSort stdin /tmp/hg38noAlts/hg38noAlts.segments.bed'] File "/lab/solexa_sabatini/genya/crisporWebsite/bin/bedOverlapMerge", line 28 print "\t".join(row) ^ SyntaxError: invalid syntax File "/lab/solexasabatini/genya/crisporWebsite/bin/bedBetween", line 71 print "Error: start position %d beyond chromSize" % start ^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Error: start position %d beyond chromSize" % start)? Writing error : Broken pipe stdout is truncated, sorry. Could not run command cat /tmp/hg38noAlts/hg38noAlts.gp | awk 'BEGIN {FS=OFS=" ";} // { gsub(" " , "", $1); gsub(" ", "_", $12); print }' | genePredToBed stdin stdout | bedToExons stdin stdout | sort -u | bedSort stdin stdout | bedOverlapMerge /dev/stdin /dev/stdout | bedBetween stdin /dev/stdout -a -s /tmp/hg38noAlts/hg38noAlts.sizes | bedSort stdin /tmp/hg38noAlts/hg38noAlts.segments.bed

maximilianh commented 4 years ago

I’m also running it here.

You’re running this in python 3 by its a python 2 script. I should make this clear in the she bang line... you’re the second person who stumbles over this...

On Fri 27 Sep 2019 at 21:24, Evgeni Frenkel notifications@github.com wrote:

I downloaded the analysis version of the genome and its gff file to add it manually but the process terminated with the following seemingly-trivial syntax error. This is in a python2 virtual environment. Should I be running python3 here?

INFO:root:Running: ['/bin/bash', '-e', '-o', 'pipefail', '-c', 'cat /tmp/hg38noAlts/hg38noAlts.gp | awk 'BEGIN {FS=OFS="\t";} // { gsub(" " , "", $1); gsub(" ", "", $12); print }' | genePredToBed stdin stdout | bedToExons stdin stdout | sort -u | bedSort stdin stdout | bedOverlapMerge /dev/stdin /dev/stdout | bedBetween stdin /dev/stdout -a -s /tmp/hg38noAlts/hg38noAlts.sizes | bedSort stdin /tmp/hg38noAlts/hg38noAlts.segments.bed'] File "/lab/solexa_sabatini/genya/crisporWebsite/bin/bedOverlapMerge", line 28 print "\t".join(row) ^ SyntaxError: invalid syntax File "/lab/solexa_sabatini/genya/crisporWebsite/bin/bedBetween", line 71 print "Error: start position %d beyond chromSize" % start ^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Error: start position %d beyond chromSize" % start)? Writing error : Broken pipe stdout is truncated, sorry. Could not run command cat /tmp/hg38noAlts/hg38noAlts.gp | awk 'BEGIN {FS=OFS=" ";} // { gsub(" " , "", $1); gsub(" ", "", $12); print }' | genePredToBed stdin stdout | bedToExons stdin stdout | sort -u | bedSort stdin stdout | bedOverlapMerge /dev/stdin /dev/stdout | bedBetween stdin /dev/stdout -a -s /tmp/hg38noAlts/hg38noAlts.sizes | bedSort stdin /tmp/hg38noAlts/hg38noAlts.segments.bed

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TIBOTDXREPSAZWBLUTQLZMXZA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7Z33MI#issuecomment-536067505, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TKX5LBJLVGFURQJNADQLZMXZANCNFSM4IPPJ6AA .

genya commented 4 years ago

oh that's strange, I was running it from within a python2 virtual environment precisely to avoid this issue, calling python --version yields Python 2.7.15+
Does the addGenome script call new instances of python? Python3 is default on the system. The error does seem to arise from running python3 instead of 2, as you suggested, because print "somestring" is ok in python2, while causes an error in python3 like the one observed.

genya commented 4 years ago

to avoid python2 vs 3 issues, I tried changing the shebang to #!/usr/bin/python2 and rerunning the crisprAddGenome script. This time I get a different kind of error, the bed module for "import bed" in bin/bedBetween cannot be found. There may also still be an error during printing in the bedOverlapMerge script, not sure if it's independent of the bed module problem. The relevant error output is pasted below. If you have any suggestions on how to resolve this, would be much appreciated -- thank you!

INFO:root:Running: ['/bin/bash', '-e', '-o', 'pipefail', '-c', 'cat /tmp/hg38noAlts5/hg38noAlts5.gp | awk \'BEGIN {FS=OFS="\t";} // { gsub(" " , "", $1); gsub(" ", "", $12); print }\' | genePredToBed stdin stdout | bedToExons stdin stdout | sort -u | bedSort stdin stdout | bedOverlapMerge /dev/stdin /dev/stdout | bedBetween stdin /dev/stdout -a -s /tmp/hg38noAlts5/hg38noAlts5.sizes | bedSort stdin /tmp/hg38noAlts5/hg38noAlts5.segments.bed'] Traceback (most recent call last): File "/lab/solexa_sabatini/genya/crisporWebsite6/bin/bedBetween", line 8, in import bed ImportError: No module named bed Traceback (most recent call last): File "/lab/solexa_sabatini/genya/crisporWebsite6/bin/bedOverlapMerge", line 63, in printLine(lastChrom, lastStart, lastEnd, names) File "/lab/solexa_sabatini/genya/crisporWebsite6/bin/bedOverlapMerge", line 28, in printLine print("\t".join(row)) IOError: [Errno 32] Broken pipe Writing error : Broken pipe stdout is truncated, sorry.

genya commented 4 years ago

Progress and remaining questions on using tools/crisprAddGenome to make a version of hg38 without alt loci:

1) changing the shebangs in bedOverlapMerge and bedBetween to specify python2 environment resolves the print errors 2) copying bed.py from tools/usrLocalBin to bin/ resolves the "import bed... Import error" arising during execution of bedBetween (incidentally, copying other files from usrLocalBin to bin caused errors) 3) the gff file for the "analysis" hg38 (GCA_000001405.15_GRCh38_no_alt_analysis_set.fna) has references to the alt loci and these cause KeyError in bedBetween. I removed lines in the gff file that referred to chromosomes not found in the analysis-version hg38, as well as ChrM from both, and that resolved this error.

4) However, yet another problem arose, apparently having to do with permission to make directories in /var/www. Trying to resolve this by changing the variable crisprGenomeDir in crisprAddGenome from "/var/www/crispor/genomes/" to one of my home directories.

INFO:root:Running: ['/bin/bash', '-e', '-o', 'pipefail', '-c', 'cat /tmp/hg38altsFiltered/hg38altsFiltered.gp | awk \'BEGIN {FS=OFS="\t";} // { gsub(" " , "", $1); gsub(" ", "", $12); print }\' | genePredToBed stdin stdout | bedToExons stdin stdout | sort -u | bedSort stdin stdout | bedOverlapMerge /dev/stdin /dev/stdout | bedBetween stdin /dev/stdout -a -s /tmp/hg38altsFiltered/hg38altsFiltered.sizes | bedSort stdin /tmp/hg38altsFiltered/hg38altsFiltered.segments.bed'] Reading beds ... Reading stdin... 307679 features read 307748 features written 0 genes/exons were dropped because one includes the other Traceback (most recent call last): File "/lab/solexa_sabatini/genya/crisporWebsite6/tools/crisprAddGenome", line 1008, in getFastaAndGff(genomeRows, options.gffFname, options) File "/lab/solexa_sabatini/genya/crisporWebsite6/tools/crisprAddGenome", line 852, in getFastaAndGff moveToDest(outFnames, finalDbDir) File "/lab/solexa_sabatini/genya/crisporWebsite6/tools/crisprAddGenome", line 276, in moveToDest os.makedirs(finalDbDir) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 150, in makedirs makedirs(head, mode) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 150, in makedirs makedirs(head, mode) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 150, in makedirs makedirs(head, mode) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 157, in makedirs mkdir(name, mode) OSError: [Errno 13] Permission denied: '/var/www'

maximilianh commented 4 years ago

Sorry for the delay, I ran into similar problems as you did. No idea why NCBI includes invalid chrome in their GFF file, I used the old hg38 .bed instead. Also, I've fixed the python version now in these files. Do you want to use the result: http://crispor.tefor.net/genomes/hg38anset/

On Mon, Sep 30, 2019 at 6:40 AM Evgeni Frenkel notifications@github.com wrote:

Progress and remaining questions on using tools/crisprAddGenome to make a version of hg38 without alt loci:

1.

changing the shebangs in bedOverlapMerge and bedBetween to specify python2 environment resolves the print errors 2.

copying bed.py from tools/usrLocalBin to bin/ resolves the "import bed... Import error" arising during execution of bedBetween (incidentally, copying other files from usrLocalBin to bin caused errors) 3.

the gff file for the "analysis" hg38 (GCA_000001405.15_GRCh38_no_alt_analysis_set.fna) has references to the alt loci and these cause KeyError in bedBetween. I removed lines in the gff file that referred to chromosomes not found in the analysis-version hg38, as well as ChrM from both, and that resolved this error. 4.

However, yet another problem arose, apparently having to do with permission to make directories in /var/www. Trying to resolve this by changing the variable crisprGenomeDir in crisprAddGenome from "/var/www/crispor/genomes/" to one of my home directories.

INFO:root:Running: ['/bin/bash', '-e', '-o', 'pipefail', '-c', 'cat /tmp/hg38altsFiltered/hg38altsFiltered.gp | awk 'BEGIN {FS=OFS="\t";} // { gsub(" " , "", $1); gsub(" ", "", $12); print }' | genePredToBed stdin stdout | bedToExons stdin stdout | sort -u | bedSort stdin stdout | bedOverlapMerge /dev/stdin /dev/stdout | bedBetween stdin /dev/stdout -a -s /tmp/hg38altsFiltered/hg38altsFiltered.sizes | bedSort stdin /tmp/hg38altsFiltered/hg38altsFiltered.segments.bed'] Reading beds ... Reading stdin... 307679 features read 307748 features written 0 genes/exons were dropped because one includes the other Traceback (most recent call last): File "/lab/solexa_sabatini/genya/crisporWebsite6/tools/crisprAddGenome", line 1008, in getFastaAndGff(genomeRows, options.gffFname, options) File "/lab/solexa_sabatini/genya/crisporWebsite6/tools/crisprAddGenome", line 852, in getFastaAndGff moveToDest(outFnames, finalDbDir) File "/lab/solexa_sabatini/genya/crisporWebsite6/tools/crisprAddGenome", line 276, in moveToDest os.makedirs(finalDbDir) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 150, in makedirs makedirs(head, mode) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 150, in makedirs makedirs(head, mode) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 150, in makedirs makedirs(head, mode) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 157, in makedirs mkdir(name, mode) OSError: [Errno 13] Permission denied: '/var/www'

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TNOTXTK3VA65ZLEK23QMF7MLA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD74MR3I#issuecomment-536398061, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TJS2HZP53YOH5C3KB3QMF7MLANCNFSM4IPPJ6AA .

maximilianh commented 4 years ago

You can also try it out now on the web inteface on crispor.org

On Mon, Sep 30, 2019 at 10:16 AM Maximilian Haeussler maximilianh@gmail.com wrote:

Sorry for the delay, I ran into similar problems as you did. No idea why NCBI includes invalid chrome in their GFF file, I used the old hg38 .bed instead. Also, I've fixed the python version now in these files. Do you want to use the result: http://crispor.tefor.net/genomes/hg38anset/

On Mon, Sep 30, 2019 at 6:40 AM Evgeni Frenkel notifications@github.com wrote:

Progress and remaining questions on using tools/crisprAddGenome to make a version of hg38 without alt loci:

1.

changing the shebangs in bedOverlapMerge and bedBetween to specify python2 environment resolves the print errors 2.

copying bed.py from tools/usrLocalBin to bin/ resolves the "import bed... Import error" arising during execution of bedBetween (incidentally, copying other files from usrLocalBin to bin caused errors) 3.

the gff file for the "analysis" hg38 (GCA_000001405.15_GRCh38_no_alt_analysis_set.fna) has references to the alt loci and these cause KeyError in bedBetween. I removed lines in the gff file that referred to chromosomes not found in the analysis-version hg38, as well as ChrM from both, and that resolved this error. 4.

However, yet another problem arose, apparently having to do with permission to make directories in /var/www. Trying to resolve this by changing the variable crisprGenomeDir in crisprAddGenome from "/var/www/crispor/genomes/" to one of my home directories.

INFO:root:Running: ['/bin/bash', '-e', '-o', 'pipefail', '-c', 'cat /tmp/hg38altsFiltered/hg38altsFiltered.gp | awk 'BEGIN {FS=OFS="\t";} // { gsub(" " , "", $1); gsub(" ", "", $12); print }' | genePredToBed stdin stdout | bedToExons stdin stdout | sort -u | bedSort stdin stdout | bedOverlapMerge /dev/stdin /dev/stdout | bedBetween stdin /dev/stdout -a -s /tmp/hg38altsFiltered/hg38altsFiltered.sizes | bedSort stdin /tmp/hg38altsFiltered/hg38altsFiltered.segments.bed'] Reading beds ... Reading stdin... 307679 features read 307748 features written 0 genes/exons were dropped because one includes the other Traceback (most recent call last): File "/lab/solexa_sabatini/genya/crisporWebsite6/tools/crisprAddGenome", line 1008, in getFastaAndGff(genomeRows, options.gffFname, options) File "/lab/solexa_sabatini/genya/crisporWebsite6/tools/crisprAddGenome", line 852, in getFastaAndGff moveToDest(outFnames, finalDbDir) File "/lab/solexa_sabatini/genya/crisporWebsite6/tools/crisprAddGenome", line 276, in moveToDest os.makedirs(finalDbDir) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 150, in makedirs makedirs(head, mode) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 150, in makedirs makedirs(head, mode) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 150, in makedirs makedirs(head, mode) File "/lab/solexa_sabatini/genya/crisporWebsite6/lib/python2.7/os.py", line 157, in makedirs mkdir(name, mode) OSError: [Errno 13] Permission denied: '/var/www'

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TNOTXTK3VA65ZLEK23QMF7MLA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD74MR3I#issuecomment-536398061, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TJS2HZP53YOH5C3KB3QMF7MLANCNFSM4IPPJ6AA .

genya commented 4 years ago

yes this seems to work -- thank you so much! I was running in to more directory issues (eg "cd /data/www/crispor && ./makeGenomeInfo: line 0: cd: /data/www/crispor: No such file or directory"). I'll confirm that it works for genome-wide library construction.

maximilianh commented 4 years ago

Great! If it works for the difficult regions, we can close this issue.

On Mon, Sep 30, 2019 at 1:00 PM Evgeni Frenkel notifications@github.com wrote:

yes this seems to work -- thank you so much! I was running in to more directory issues (eg "cd /data/www/crispor && ./makeGenomeInfo: line 0: cd: /data/www/crispor: No such file or directory"). I'll confirm that it works for genome-wide library construction.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TNVZWUF57OOI3NPAIDQMHL5PA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD75IMHQ#issuecomment-536512030, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TOO5AQOHW3PLDHHWQTQMHL5PANCNFSM4IPPJ6AA .

genya commented 4 years ago

Regions like the one pasted below still crash the Lindel calculation, presumably due to ambiguous alignment issues and resulting inability to retrieve flanking sequence. On the crispor website however, inputting the chromosomal range chr16:28458366-28458441 instead of the sequence allows for calculation of all the scores. Do you plan to add chromosomal range as an accepted input for command-line CRISPOR? With this functionality, then really everything would work (or if the Lindel calculation could return something like NotEnoughFlankSequence instead of crashing), but as is, there are no crashes as long as the Lindel score is not calculated and so everything basically works.

The input >ENSG00000261832 AC138894.1 exon ENST00000637378.1_8 range=chr16:28458366-28458441 strand=- tcatgtgttggctttttcagATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGgtaagtttgtttgttttcct

produces: INFO:root:Progress 6XB0kHxeOALd1WjknpdD - outcome - Calculating editing outcomes Error: No PAM sequence found. Please check your sequence and try again ['TATTTGGGGAATACTGAGCCCAGCAGAACCCAATGGTGAAAAGCAACAGAAAGAACTTCA', u'CAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCC', u'AAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCA', u'TTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAA', u'TCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAAC', 'GAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCCAA', 'TGAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCCA', 'ATGAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCC', u'CTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTT', u'TGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAA', u'CAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAA', u'CGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAG', u'GGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGC', 'AATCATACCCACCTGAACCCCTCCAAATCCCTTCGGAGCTAAATATCGCTCACATTCAAG', 'GAATCATACCCACCTGAACCCCTCCAAATCCCTTCGGAGCTAAATATCGCTCACATTCAA', u'TTGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTA', u'TGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTAT', u'TGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCA', u'ATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAG', u'TGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGT', u'GTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTA', u'GATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATT', u'ATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATTGCG', u'TTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATTGCGG'] ['s45-', 's49+', 's50+', 's59+', 's60+', 's72-', 's73-', 's74-', 's88+', 's92+', 's119+', 's125+', 's126+', 's164-', 's165-', 's168+', 's169+', 's175+', 's178+', 's179+', 's180+', 's186+', 's189+', 's190+'] Traceback (most recent call last): File "crispor.py", line 8305, in main() File "crispor.py", line 8302, in main mainCommandLine() File "crispor.py", line 8111, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4301, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3846, in processSubmission createBatchEffScoreTable(batchId, queue) File "crispor.py", line 3462, in createBatchEffScoreTable guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue) File "crispor.py", line 3403, in calcSaveEffScores mutScores = crisporEffScores.calcMutSeqs(pamIds, longSeqs, enz, scoreNames=mutScoreNames) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 1311, in calcMutSeqs mutSeqDict = calcLindelScore(seqIds, seqs) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 748, in calcLindelScore return runLindel(seqIds, trimSeqs(seqs, -33, 27)) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 724, in runLindel y_hat, fs = Lindel.Predictor.gen_prediction(seq,weights,prerequesites) ValueError: too many values to unpack

which is fixed by changing spCas9MutScoreNames = ["oof", 'lindel'] # lindel is only added for spCas9 in crispor.py to spCas9MutScoreNames = ["oof"]

maximilianh commented 4 years ago

Ohhh... this is again a problem in bwa, I think.

BWASW finds the best match at 'tmp', '0', 'chr16', '29050652', '0', '76M', '', '0', '0', 'TCATGTGTTGGCTTTTTCAGATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGGTAAGTTTGTTTGTTTTCCT', '', 'AS:i:72', 'XS:i:76', 'XF:i:1', 'XE:i:1', 'NM:i:1']

But that match has a single mismatch.

Your input sequence has multiple perfect matches. They are, among others 28458366 28458441 (negative strand) or 28656613 28656688.

You can see what exactly is happening when you run it with -d.

I'm not sure what I can do here... update BWA? file a bug with BWA? Use a different BWA version or use BWA-ALN or BWA-MEM instead of BWASW... Hm.

On Mon, Sep 30, 2019 at 2:24 PM Evgeni Frenkel notifications@github.com wrote:

Regions like the one pasted below still crash the Lindel calculation, presumably due to ambiguous alignment issues and resulting inability to retrieve flanking sequence. On the crispor website however, inputting the chromosomal range chr16:28458366-28458441 instead of the sequence allows for calculation of all the scores. If this functionality were part of command-line CRISPOR, then really everything would work (or if the Lindel calculation could return something like NotEnoughFlankSequence instead of crashing), but as is, there are no crashes as long as the Lindel score is not calculated and so everything basically works.

The input

ENSG00000261832 AC138894.1 exon ENST00000637378.1_8 range=chr16:28458366-28458441 strand=- tcatgtgttggctttttcagATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGgtaagtttgtttgttttcct

produces: INFO:root:Progress 6XB0kHxeOALd1WjknpdD - outcome - Calculating editing outcomes Error: No PAM sequence found. Please check your sequence and try again ['TATTTGGGGAATACTGAGCCCAGCAGAACCCAATGGTGAAAAGCAACAGAAAGAACTTCA', u'CAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCC', u'AAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCA', u'TTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAA', u'TCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAAC', 'GAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCCAA', 'TGAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCCA', 'ATGAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCC', u'CTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTT', u'TGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAA', u'CAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAA', u'CGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAG', u'GGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGC', 'AATCATACCCACCTGAACCCCTCCAAATCCCTTCGGAGCTAAATATCGCTCACATTCAAG', 'GAATCATACCCACCTGAACCCCTCCAAATCCCTTCGGAGCTAAATATCGCTCACATTCAA', u'TTGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTA', u'TGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTAT', u'TGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCA', u'ATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAG', u'TGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGT', u'GTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTA', u'GATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATT', u'ATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATTGCG', u'TTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATTGCGG'] ['s45-', 's49+', 's50+', 's59+', 's60+', 's72-', 's73-', 's74-', 's88+', 's92+', 's119+', 's125+', 's126+', 's164-', 's165-', 's168+', 's169+', 's175+', 's178+', 's179+', 's180+', 's186+', 's189+', 's190+'] Traceback (most recent call last): File "crispor.py", line 8305, in main() File "crispor.py", line 8302, in main mainCommandLine() File "crispor.py", line 8111, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4301, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3846, in processSubmission createBatchEffScoreTable(batchId, queue) File "crispor.py", line 3462, in createBatchEffScoreTable guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue) File "crispor.py", line 3403, in calcSaveEffScores mutScores = crisporEffScores.calcMutSeqs(pamIds, longSeqs, enz, scoreNames=mutScoreNames) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 1311, in calcMutSeqs mutSeqDict = calcLindelScore(seqIds, seqs) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 748, in calcLindelScore return runLindel(seqIds, trimSeqs(seqs, -33, 27)) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 724, in runLindel y_hat, fs = Lindel.Predictor.gen_prediction(seq,weights,prerequesites) ValueError: too many values to unpack

which is fixed by changing spCas9MutScoreNames = ["oof", 'lindel'] # lindel is only added for spCas9 in crispor.py to spCas9MutScoreNames = ["oof"]

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

genya commented 4 years ago

these regions obviously do not contain any good guides (in terms of specificity) so it doesn't really matter for the efficiency scores to be calculated for them. The main problem is that they crash crispor, specifically in the Lindel calculation. In my case, I feed crispor large batches of exons and the whole process crashes if one of these problematic sequences is in the batch. One solution on my end would be to write a script to identify them and filter them out, which I may do, but within CRISPOR, the possible solutions are: 1) modify the Lindel calculator so that it returns NotEnoughFlankingSequence instead of crashing when it can't retrieve flanking sequence (the doench score calculator does this) 2) allow for chromosomal range inputs, in which case the flanking sequence would be unambiguous

maximilianh commented 4 years ago

I've added a .bed input option now so you can work around this at least. Instead of input.fa you can provide input.bed and it'll grab the sequences from these locations and since

On Mon, Sep 30, 2019 at 2:54 PM Maximilian Haeussler maximilianh@gmail.com wrote:

Ohhh... this is again a problem in bwa, I think.

BWASW finds the best match at 'tmp', '0', 'chr16', '29050652', '0', '76M', '', '0', '0', 'TCATGTGTTGGCTTTTTCAGATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGGTAAGTTTGTTTGTTTTCCT', '', 'AS:i:72', 'XS:i:76', 'XF:i:1', 'XE:i:1', 'NM:i:1']

But that match has a single mismatch.

Your input sequence has multiple perfect matches. They are, among others 28458366 28458441 (negative strand) or 28656613 28656688.

You can see what exactly is happening when you run it with -d.

I'm not sure what I can do here... update BWA? file a bug with BWA? Use a different BWA version or use BWA-ALN or BWA-MEM instead of BWASW... Hm.

On Mon, Sep 30, 2019 at 2:24 PM Evgeni Frenkel notifications@github.com wrote:

Regions like the one pasted below still crash the Lindel calculation, presumably due to ambiguous alignment issues and resulting inability to retrieve flanking sequence. On the crispor website however, inputting the chromosomal range chr16:28458366-28458441 instead of the sequence allows for calculation of all the scores. If this functionality were part of command-line CRISPOR, then really everything would work (or if the Lindel calculation could return something like NotEnoughFlankSequence instead of crashing), but as is, there are no crashes as long as the Lindel score is not calculated and so everything basically works.

The input

ENSG00000261832 AC138894.1 exon ENST00000637378.1_8 range=chr16:28458366-28458441 strand=- tcatgtgttggctttttcagATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGgtaagtttgtttgttttcct

produces: INFO:root:Progress 6XB0kHxeOALd1WjknpdD - outcome - Calculating editing outcomes Error: No PAM sequence found. Please check your sequence and try again ['TATTTGGGGAATACTGAGCCCAGCAGAACCCAATGGTGAAAAGCAACAGAAAGAACTTCA', u'CAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCC', u'AAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCA', u'TTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAA', u'TCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAAC', 'GAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCCAA', 'TGAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCCA', 'ATGAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCC', u'CTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTT', u'TGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAA', u'CAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAA', u'CGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAG', u'GGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGC', 'AATCATACCCACCTGAACCCCTCCAAATCCCTTCGGAGCTAAATATCGCTCACATTCAAG', 'GAATCATACCCACCTGAACCCCTCCAAATCCCTTCGGAGCTAAATATCGCTCACATTCAA', u'TTGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTA', u'TGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTAT', u'TGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCA', u'ATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAG', u'TGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGT', u'GTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTA', u'GATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATT', u'ATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATTGCG', u'TTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATTGCGG'] ['s45-', 's49+', 's50+', 's59+', 's60+', 's72-', 's73-', 's74-', 's88+', 's92+', 's119+', 's125+', 's126+', 's164-', 's165-', 's168+', 's169+', 's175+', 's178+', 's179+', 's180+', 's186+', 's189+', 's190+'] Traceback (most recent call last): File "crispor.py", line 8305, in main() File "crispor.py", line 8302, in main mainCommandLine() File "crispor.py", line 8111, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4301, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3846, in processSubmission createBatchEffScoreTable(batchId, queue) File "crispor.py", line 3462, in createBatchEffScoreTable guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue) File "crispor.py", line 3403, in calcSaveEffScores mutScores = crisporEffScores.calcMutSeqs(pamIds, longSeqs, enz, scoreNames=mutScoreNames) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 1311, in calcMutSeqs mutSeqDict = calcLindelScore(seqIds, seqs) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 748, in calcLindelScore return runLindel(seqIds, trimSeqs(seqs, -33, 27)) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 724, in runLindel y_hat, fs = Lindel.Predictor.gen_prediction(seq,weights,prerequesites) ValueError: too many values to unpack

which is fixed by changing spCas9MutScoreNames = ["oof", 'lindel'] # lindel is only added for spCas9 in crispor.py to spCas9MutScoreNames = ["oof"]

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

genya commented 4 years ago

that sounds great, thank you! I'll try it this evening.

maximilianh commented 4 years ago

I've fixed the Lindel interface and your input sequence is working now. However, it's still placed at the wrong location, unless you use the .bed input, as I'm not sure how to work around the BWA bug yet.

On Mon, Sep 30, 2019 at 3:23 PM Maximilian Haeussler maximilianh@gmail.com wrote:

I've added a .bed input option now so you can work around this at least. Instead of input.fa you can provide input.bed and it'll grab the sequences from these locations and since

On Mon, Sep 30, 2019 at 2:54 PM Maximilian Haeussler maximilianh@gmail.com wrote:

Ohhh... this is again a problem in bwa, I think.

BWASW finds the best match at 'tmp', '0', 'chr16', '29050652', '0', '76M', '', '0', '0', 'TCATGTGTTGGCTTTTTCAGATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGGTAAGTTTGTTTGTTTTCCT', '', 'AS:i:72', 'XS:i:76', 'XF:i:1', 'XE:i:1', 'NM:i:1']

But that match has a single mismatch.

Your input sequence has multiple perfect matches. They are, among others 28458366 28458441 (negative strand) or 28656613 28656688.

You can see what exactly is happening when you run it with -d.

I'm not sure what I can do here... update BWA? file a bug with BWA? Use a different BWA version or use BWA-ALN or BWA-MEM instead of BWASW... Hm.

On Mon, Sep 30, 2019 at 2:24 PM Evgeni Frenkel notifications@github.com wrote:

Regions like the one pasted below still crash the Lindel calculation, presumably due to ambiguous alignment issues and resulting inability to retrieve flanking sequence. On the crispor website however, inputting the chromosomal range chr16:28458366-28458441 instead of the sequence allows for calculation of all the scores. If this functionality were part of command-line CRISPOR, then really everything would work (or if the Lindel calculation could return something like NotEnoughFlankSequence instead of crashing), but as is, there are no crashes as long as the Lindel score is not calculated and so everything basically works.

The input

ENSG00000261832 AC138894.1 exon ENST00000637378.1_8 range=chr16:28458366-28458441 strand=- tcatgtgttggctttttcagATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGgtaagtttgtttgttttcct

produces: INFO:root:Progress 6XB0kHxeOALd1WjknpdD - outcome - Calculating editing outcomes Error: No PAM sequence found. Please check your sequence and try again ['TATTTGGGGAATACTGAGCCCAGCAGAACCCAATGGTGAAAAGCAACAGAAAGAACTTCA', u'CAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCC', u'AAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCA', u'TTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAA', u'TCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAAC', 'GAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCCAA', 'TGAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCCA', 'ATGAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCC', u'CTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTT', u'TGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAA', u'CAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAA', u'CGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAG', u'GGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGC', 'AATCATACCCACCTGAACCCCTCCAAATCCCTTCGGAGCTAAATATCGCTCACATTCAAG', 'GAATCATACCCACCTGAACCCCTCCAAATCCCTTCGGAGCTAAATATCGCTCACATTCAA', u'TTGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTA', u'TGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTAT', u'TGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCA', u'ATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAG', u'TGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGT', u'GTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTA', u'GATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATT', u'ATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATTGCG', u'TTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATTGCGG'] ['s45-', 's49+', 's50+', 's59+', 's60+', 's72-', 's73-', 's74-', 's88+', 's92+', 's119+', 's125+', 's126+', 's164-', 's165-', 's168+', 's169+', 's175+', 's178+', 's179+', 's180+', 's186+', 's189+', 's190+'] Traceback (most recent call last): File "crispor.py", line 8305, in main() File "crispor.py", line 8302, in main mainCommandLine() File "crispor.py", line 8111, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4301, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3846, in processSubmission createBatchEffScoreTable(batchId, queue) File "crispor.py", line 3462, in createBatchEffScoreTable guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue) File "crispor.py", line 3403, in calcSaveEffScores mutScores = crisporEffScores.calcMutSeqs(pamIds, longSeqs, enz, scoreNames=mutScoreNames) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 1311, in calcMutSeqs mutSeqDict = calcLindelScore(seqIds, seqs) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 748, in calcLindelScore return runLindel(seqIds, trimSeqs(seqs, -33, 27)) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 724, in runLindel y_hat, fs = Lindel.Predictor.gen_prediction(seq,weights,prerequesites) ValueError: too many values to unpack

which is fixed by changing spCas9MutScoreNames = ["oof", 'lindel'] # lindel is only added for spCas9 in crispor.py to spCas9MutScoreNames = ["oof"]

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

maximilianh commented 4 years ago

Can you do a git pull and try again?

On Mon, Sep 30, 2019 at 3:28 PM Maximilian Haeussler maximilianh@gmail.com wrote:

I've fixed the Lindel interface and your input sequence is working now. However, it's still placed at the wrong location, unless you use the .bed input, as I'm not sure how to work around the BWA bug yet.

On Mon, Sep 30, 2019 at 3:23 PM Maximilian Haeussler maximilianh@gmail.com wrote:

I've added a .bed input option now so you can work around this at least. Instead of input.fa you can provide input.bed and it'll grab the sequences from these locations and since

On Mon, Sep 30, 2019 at 2:54 PM Maximilian Haeussler maximilianh@gmail.com wrote:

Ohhh... this is again a problem in bwa, I think.

BWASW finds the best match at 'tmp', '0', 'chr16', '29050652', '0', '76M', '', '0', '0', 'TCATGTGTTGGCTTTTTCAGATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGGTAAGTTTGTTTGTTTTCCT', '', 'AS:i:72', 'XS:i:76', 'XF:i:1', 'XE:i:1', 'NM:i:1']

But that match has a single mismatch.

Your input sequence has multiple perfect matches. They are, among others 28458366 28458441 (negative strand) or 28656613 28656688.

You can see what exactly is happening when you run it with -d.

I'm not sure what I can do here... update BWA? file a bug with BWA? Use a different BWA version or use BWA-ALN or BWA-MEM instead of BWASW... Hm.

On Mon, Sep 30, 2019 at 2:24 PM Evgeni Frenkel notifications@github.com wrote:

Regions like the one pasted below still crash the Lindel calculation, presumably due to ambiguous alignment issues and resulting inability to retrieve flanking sequence. On the crispor website however, inputting the chromosomal range chr16:28458366-28458441 instead of the sequence allows for calculation of all the scores. If this functionality were part of command-line CRISPOR, then really everything would work (or if the Lindel calculation could return something like NotEnoughFlankSequence instead of crashing), but as is, there are no crashes as long as the Lindel score is not calculated and so everything basically works.

The input

ENSG00000261832 AC138894.1 exon ENST00000637378.1_8 range=chr16:28458366-28458441 strand=- tcatgtgttggctttttcagATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGgtaagtttgtttgttttcct

produces: INFO:root:Progress 6XB0kHxeOALd1WjknpdD - outcome - Calculating editing outcomes Error: No PAM sequence found. Please check your sequence and try again ['TATTTGGGGAATACTGAGCCCAGCAGAACCCAATGGTGAAAAGCAACAGAAAGAACTTCA', u'CAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCC', u'AAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCA', u'TTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAA', u'TCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAAC', 'GAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCCAA', 'TGAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCCA', 'ATGAACAATAGATGTCCGTCCTTGTTGTGTATTTGGGGAATACTGAGCCCAGCAGAACCC', u'CTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTT', u'TGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAA', u'CAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAA', u'CGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAG', u'GGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGC', 'AATCATACCCACCTGAACCCCTCCAAATCCCTTCGGAGCTAAATATCGCTCACATTCAAG', 'GAATCATACCCACCTGAACCCCTCCAAATCCCTTCGGAGCTAAATATCGCTCACATTCAA', u'TTGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTA', u'TGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTAT', u'TGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCA', u'ATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAG', u'TGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGT', u'GTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTA', u'GATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATT', u'ATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATTGCG', u'TTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTATCAATTGCGG'] ['s45-', 's49+', 's50+', 's59+', 's60+', 's72-', 's73-', 's74-', 's88+', 's92+', 's119+', 's125+', 's126+', 's164-', 's165-', 's168+', 's169+', 's175+', 's178+', 's179+', 's180+', 's186+', 's189+', 's190+'] Traceback (most recent call last): File "crispor.py", line 8305, in main() File "crispor.py", line 8302, in main mainCommandLine() File "crispor.py", line 8111, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4301, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3846, in processSubmission createBatchEffScoreTable(batchId, queue) File "crispor.py", line 3462, in createBatchEffScoreTable guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue) File "crispor.py", line 3403, in calcSaveEffScores mutScores = crisporEffScores.calcMutSeqs(pamIds, longSeqs, enz, scoreNames=mutScoreNames) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 1311, in calcMutSeqs mutSeqDict = calcLindelScore(seqIds, seqs) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 748, in calcLindelScore return runLindel(seqIds, trimSeqs(seqs, -33, 27)) File "/lab/solexa_sabatini/genya/crisporWebsite6/crisporEffScores.py", line 724, in runLindel y_hat, fs = Lindel.Predictor.gen_prediction(seq,weights,prerequesites) ValueError: too many values to unpack

which is fixed by changing spCas9MutScoreNames = ["oof", 'lindel'] # lindel is only added for spCas9 in crispor.py to spCas9MutScoreNames = ["oof"]

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

genya commented 4 years ago

Thank you! Chromosomal ranges inputted as bedfiles work. However you mentioned that the sequence input now works, but I still get that crispor crashes as follows:

INFO:root: * running on sequence 'ENSG00000240038 AMY2B exon ENST00000361355.8_0,ENST00000610648.1_0 range=chr1:103571583-103571790 strand=+', guideLen=20, seqLen=208 INFO:root:Progress 6XB0kHxeOALd1WjknpdD - bwasw - Searching genome for one 100% identical match to input sequence [M::bwa_idx_load_from_disk] read 0 ALT contigs [bsw2_aln] read 1 sequences/pairs (208 bp) ... [main] Version: 0.7.15-r1140 [main] CMD: /lab/solexa_sabatini/genya/CRISPOR3/bin/Linux/bwa bwasw -T 20 /lab/solexa_sabatini/genya/CRISPOR3/genomes/hg38anset/hg38anset.fa /tmp/crisporBestMatch7bgNVY.fa [main] Real time: 3.166 sec; CPU: 3.165 sec WARNING:root:Input sequence: ACTGACAACTTCAAAGCAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTTAGCTCCCAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTAT WARNING:root:Genome sequence: ACTGACAACTTCAAAGCAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTAT Traceback (most recent call last): File "crispor.py", line 8385, in main() File "crispor.py", line 8382, in main mainCommandLine() File "crispor.py", line 8191, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4334, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3853, in processSubmission extSeq = extendAndGetSeq(genome, chrom, start, end, strand, batchInfo["seq"]) File "crispor.py", line 1981, in extendAndGetSeq logging.warn("Diff String : %s" % highlightMismatches(oldSeq, genomeSeq, 0)) File "crispor.py", line 1535, in highlightMismatches assert(len(guide)==len(offTarget)) AssertionError

maximilianh commented 4 years ago

Ah, I just introduced this one! Thanks!

I've added a huge warning message now in command line mode, you can see how this problem happens. However, I still haven't found a reason why BWASW misplaces this sequence. BWASW has caused me a lot of grey hair over the last years. I should really get rid of it...

On Tue, Oct 1, 2019 at 3:33 AM Evgeni Frenkel notifications@github.com wrote:

Thank you! Chromosomal ranges inputted as bedfiles work. However you mentioned that the sequence input now works, but I still get that crispor crashes as follows:

INFO:root: * running on sequence 'ENSG00000240038 AMY2B exon ENST00000361355.8_0,ENST00000610648.1_0 range=chr1:103571583-103571790 strand=+', guideLen=20, seqLen=208 INFO:root:Progress 6XB0kHxeOALd1WjknpdD - bwasw - Searching genome for one 100% identical match to input sequence [M::bwa_idx_load_from_disk] read 0 ALT contigs [bsw2_aln] read 1 sequences/pairs (208 bp) ... [main] Version: 0.7.15-r1140 [main] CMD: /lab/solexa_sabatini/genya/CRISPOR3/bin/Linux/bwa bwasw -T 20 /lab/solexa_sabatini/genya/CRISPOR3/genomes/hg38anset/hg38anset.fa /tmp/crisporBestMatch7bgNVY.fa [main] Real time: 3.166 sec; CPU: 3.165 sec WARNING:root:Input sequence: ACTGACAACTTCAAAGCAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTTAGCTCCCAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTAT WARNING:root:Genome sequence: ACTGACAACTTCAAAGCAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTAT Traceback (most recent call last): File "crispor.py", line 8385, in main() File "crispor.py", line 8382, in main mainCommandLine() File "crispor.py", line 8191, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4334, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3853, in processSubmission extSeq = extendAndGetSeq(genome, chrom, start, end, strand, batchInfo["seq"]) File "crispor.py", line 1981, in extendAndGetSeq logging.warn("Diff String : %s" % highlightMismatches(oldSeq, genomeSeq, 0)) File "crispor.py", line 1535, in highlightMismatches assert(len(guide)==len(offTarget)) AssertionError

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TL3FVV6PAXZYL3DAUTQMKSIDA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD77TSTA#issuecomment-536820044, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TNLZMSROEYD3UDUUPDQMKSIDANCNFSM4IPPJ6AA .

maximilianh commented 4 years ago

I meant to write: "so you can see how this problem happens", the output now looks like this:

DEBUG:root:Found 1 best matches, 1 on non-alts. matches: [('chr16', 29050651, 29050727, '+')], best match ('chr16', 29050651, 29050727, '+') WARNING:root:Input sequence has SNPs compared to genome, not returning extended seq: WARNING:root:- Input sequence: TCATGTGTTGGCTTTTTCAGATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGGTAAGTTTGTTTGTTTTCCT WARNING:root:- Genome sequence: TCATGTGTTGGCTTTTTCAGATCCCCCCTTCTGCAAGAAAGGCTCTTTGCAACTGGGTAAGTTTGTTTGTTTTCCT WARNING:root:- Diff String : .........................................*..................................

I also manually checked your input sequence with this BWA command:

$ bwa bwasw -T 20 /data/www/crisporBeta/genomes/hg38anset/hg38anset.fa evgeni.fa > /tmp/temp.sam

and BWASW finds really just a single match:

tmp 0 chr16 29050652 0 76M * 0 0

TCATGTGTTGGCTTTTTCAGATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGGTAAGTTTGTTTGTTTTCCT

AS:i:72 XS:i:76 XF:i:1 XE:i:1 NM:i:1

BUT and this is really weird, if I increase the mismatch penalty, it seems to work!

bwa bwasw -b 4 /data/www/crisporBeta/genomes/hg38anset/hg38anset.fa test.fa

finds:

test 16 chr16 28344528 0 76M * 0 0

aggaaaacaaacaaacttacCCAGTTGCAAAGAGGCTTTCTTGCAGAAGGGGGGATctgaaaaagccaacacatga

AS:i:76 XS:i:76 XF:i:3 XE:i:0 NM:i:0

What I find annoying that BWASW still doesn't find all matches, so it won't be able to detect if your target sequence matches somewhere else. Not ideal.

But it seems good to force it to avoid all gaps and mismatches, e.g. by setting -b 100 -q 100 -r 100

On Tue, Oct 1, 2019 at 11:19 AM Maximilian Haeussler maximilianh@gmail.com wrote:

Ah, I just introduced this one! Thanks!

I've added a huge warning message now in command line mode, you can see how this problem happens. However, I still haven't found a reason why BWASW misplaces this sequence. BWASW has caused me a lot of grey hair over the last years. I should really get rid of it...

On Tue, Oct 1, 2019 at 3:33 AM Evgeni Frenkel notifications@github.com wrote:

Thank you! Chromosomal ranges inputted as bedfiles work. However you mentioned that the sequence input now works, but I still get that crispor crashes as follows:

INFO:root: * running on sequence 'ENSG00000240038 AMY2B exon ENST00000361355.8_0,ENST00000610648.1_0 range=chr1:103571583-103571790 strand=+', guideLen=20, seqLen=208 INFO:root:Progress 6XB0kHxeOALd1WjknpdD - bwasw - Searching genome for one 100% identical match to input sequence [M::bwa_idx_load_from_disk] read 0 ALT contigs [bsw2_aln] read 1 sequences/pairs (208 bp) ... [main] Version: 0.7.15-r1140 [main] CMD: /lab/solexa_sabatini/genya/CRISPOR3/bin/Linux/bwa bwasw -T 20 /lab/solexa_sabatini/genya/CRISPOR3/genomes/hg38anset/hg38anset.fa /tmp/crisporBestMatch7bgNVY.fa [main] Real time: 3.166 sec; CPU: 3.165 sec WARNING:root:Input sequence: ACTGACAACTTCAAAGCAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTTAGCTCCCAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTAT WARNING:root:Genome sequence: ACTGACAACTTCAAAGCAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTAT Traceback (most recent call last): File "crispor.py", line 8385, in main() File "crispor.py", line 8382, in main mainCommandLine() File "crispor.py", line 8191, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4334, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3853, in processSubmission extSeq = extendAndGetSeq(genome, chrom, start, end, strand, batchInfo["seq"]) File "crispor.py", line 1981, in extendAndGetSeq logging.warn("Diff String : %s" % highlightMismatches(oldSeq, genomeSeq, 0)) File "crispor.py", line 1535, in highlightMismatches assert(len(guide)==len(offTarget)) AssertionError

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TL3FVV6PAXZYL3DAUTQMKSIDA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD77TSTA#issuecomment-536820044, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TNLZMSROEYD3UDUUPDQMKSIDANCNFSM4IPPJ6AA .

maximilianh commented 4 years ago

confirmed that this still happens with the current bwa master branch. 0.7.17

On Tue, Oct 1, 2019 at 11:31 AM Maximilian Haeussler maximilianh@gmail.com wrote:

I meant to write: "so you can see how this problem happens", the output now looks like this:

DEBUG:root:Found 1 best matches, 1 on non-alts. matches: [('chr16', 29050651, 29050727, '+')], best match ('chr16', 29050651, 29050727, '+') WARNING:root:Input sequence has SNPs compared to genome, not returning extended seq: WARNING:root:- Input sequence: TCATGTGTTGGCTTTTTCAGATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGGTAAGTTTGTTTGTTTTCCT WARNING:root:- Genome sequence: TCATGTGTTGGCTTTTTCAGATCCCCCCTTCTGCAAGAAAGGCTCTTTGCAACTGGGTAAGTTTGTTTGTTTTCCT WARNING:root:- Diff String : .........................................*..................................

I also manually checked your input sequence with this BWA command:

$ bwa bwasw -T 20 /data/www/crisporBeta/genomes/hg38anset/hg38anset.fa evgeni.fa > /tmp/temp.sam

and BWASW finds really just a single match:

tmp 0 chr16 29050652 0 76M 0 0 TCATGTGTTGGCTTTTTCAGATCCCCCCTTCTGCAAGAAAGCCTCTTTGCAACTGGGTAAGTTTGTTTGTTTTCCT AS:i:72 XS:i:76 XF:i:1 XE:i:1 NM:i:1

BUT and this is really weird, if I increase the mismatch penalty, it seems to work!

bwa bwasw -b 4 /data/www/crisporBeta/genomes/hg38anset/hg38anset.fa test.fa

finds:

test 16 chr16 28344528 0 76M 0 0 aggaaaacaaacaaacttacCCAGTTGCAAAGAGGCTTTCTTGCAGAAGGGGGGATctgaaaaagccaacacatga AS:i:76 XS:i:76 XF:i:3 XE:i:0 NM:i:0

What I find annoying that BWASW still doesn't find all matches, so it won't be able to detect if your target sequence matches somewhere else. Not ideal.

But it seems good to force it to avoid all gaps and mismatches, e.g. by setting -b 100 -q 100 -r 100

On Tue, Oct 1, 2019 at 11:19 AM Maximilian Haeussler maximilianh@gmail.com wrote:

Ah, I just introduced this one! Thanks!

I've added a huge warning message now in command line mode, you can see how this problem happens. However, I still haven't found a reason why BWASW misplaces this sequence. BWASW has caused me a lot of grey hair over the last years. I should really get rid of it...

On Tue, Oct 1, 2019 at 3:33 AM Evgeni Frenkel notifications@github.com wrote:

Thank you! Chromosomal ranges inputted as bedfiles work. However you mentioned that the sequence input now works, but I still get that crispor crashes as follows:

INFO:root: * running on sequence 'ENSG00000240038 AMY2B exon ENST00000361355.8_0,ENST00000610648.1_0 range=chr1:103571583-103571790 strand=+', guideLen=20, seqLen=208 INFO:root:Progress 6XB0kHxeOALd1WjknpdD - bwasw - Searching genome for one 100% identical match to input sequence [M::bwa_idx_load_from_disk] read 0 ALT contigs [bsw2_aln] read 1 sequences/pairs (208 bp) ... [main] Version: 0.7.15-r1140 [main] CMD: /lab/solexa_sabatini/genya/CRISPOR3/bin/Linux/bwa bwasw -T 20 /lab/solexa_sabatini/genya/CRISPOR3/genomes/hg38anset/hg38anset.fa /tmp/crisporBestMatch7bgNVY.fa [main] Real time: 3.166 sec; CPU: 3.165 sec WARNING:root:Input sequence: ACTGACAACTTCAAAGCAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTTAGCTCCCAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTAT WARNING:root:Genome sequence: ACTGACAACTTCAAAGCAAAATGAAGTTCTTTCTGTTGCTTTTCACCATTGGGTTCTGCTGGGCTCAGTATTCCCCAAATACACAACAAGGACGGACATCTATTGTTCATCTGTTTGAATGGCGATGGGTTGATATTGCTCTTGAATGTGAGCGATATTTAGCTCCGAAGGGATTTGGAGGGGTTCAGGTGGGTATGATTCATAGTAT Traceback (most recent call last): File "crispor.py", line 8385, in main() File "crispor.py", line 8382, in main mainCommandLine() File "crispor.py", line 8191, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4334, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3853, in processSubmission extSeq = extendAndGetSeq(genome, chrom, start, end, strand, batchInfo["seq"]) File "crispor.py", line 1981, in extendAndGetSeq logging.warn("Diff String : %s" % highlightMismatches(oldSeq, genomeSeq, 0)) File "crispor.py", line 1535, in highlightMismatches assert(len(guide)==len(offTarget)) AssertionError

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

genya commented 4 years ago

thank you so much for debugging this. It sounds like I should recode my pipeline to use bedfile chromosomal ranges as input and that avoid these issues -- if I understand correctly?

maximilianh commented 4 years ago

This would be safer, yes. You wouldn't need the (apparent) somewhat problematic step of running bwasw anymore.

Note that UCSC has just completed a full-genome run of crispor on hg38. I'm not sure if this is helpful, but you can find the bigBed file on the UCSC genome browser (I believe). Otherwise we can ask them where it is.

maximilianh commented 4 years ago

Oh using the BED file should also be a lot faster. BWASW is not very fast. In general, I have the impression I should avoid using BWASW and replace it with... I have no idea... bwa-mem maybe, I don't know. BLAT would work, but uses way too much memory.

On Tue, Oct 1, 2019 at 2:57 PM Maximilian Haeussler maximilianh@gmail.com wrote:

This would be safer, yes. You wouldn't need the (apparent) somewhat problematic step of running bwasw anymore.

Note that UCSC has just completed a full-genome run of crispor on hg38. I'm not sure if this is helpful, but you can find the bigBed file on the UCSC genome browser (I believe). Otherwise we can ask them where it is.

maximilianh commented 4 years ago

for your particular application, why didn't you use flashfry ? Wouldn't that be a lot faster?

I've never optimized crispor for bulk use. I could have, but never found the time. That being said, it's quite fast, if you put the genome onto /dev/shm and use bedtools.

On Tue, Oct 1, 2019 at 2:58 PM Maximilian Haeussler maximilianh@gmail.com wrote:

Oh using the BED file should also be a lot faster. BWASW is not very fast. In general, I have the impression I should avoid using BWASW and replace it with... I have no idea... bwa-mem maybe, I don't know. BLAT would work, but uses way too much memory.

On Tue, Oct 1, 2019 at 2:57 PM Maximilian Haeussler maximilianh@gmail.com wrote:

This would be safer, yes. You wouldn't need the (apparent) somewhat problematic step of running bwasw anymore.

Note that UCSC has just completed a full-genome run of crispor on hg38. I'm not sure if this is helpful, but you can find the bigBed file on the UCSC genome browser (I believe). Otherwise we can ask them where it is.

genya commented 4 years ago

I'm not really limited by speed, have access to a cluster and typically run several hundreds jobs in parallel. I liked that CRISPOR actually shows you the potential off-targets. I didn't see this functionality in FlashFry, last I checked.

maximilianh commented 4 years ago

That may be a reason why flashfry is quite a bit faster. If you don't have to write out all the off-targets that saves some speed? Idk. Maybe Aaron can chime in.

On Tue, Oct 1, 2019 at 3:28 PM Evgeni Frenkel notifications@github.com wrote:

I'm not really limited by speed, have access to a cluster and typically run several hundreds jobs in parallel. I liked that CRISPOR actually shows you the potential off-targets. I didn't see this functionality in FlashFry, last I checked.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

genya commented 4 years ago

for bed file inputs, I noticed a kind of inconsistency in the guideId column of the outputs. For example in the test case below, the first genomic range spans exactly 23bp of a protospacer + PAM on the forward strand. Its guideID is then 21forw, so I guess the index in guideId refers to the first base of the PAM. The second genomic range is also for 23bp of a protospacer + PAM but on the reverse strand and its guideId is 1rev, so I guess the index in guideId refers to whatever base of the PAM comes first in genomic coordinates rather than in terms of the guide sequence. It'd be more natural for the guideId to indicate the position of the same part of the guide regardless of strand.

bedfile input chr6 44307303 44307326 chr6 44307318 44307341

produces output

seqId guideId targetSeq mitSpecScore offtargetCount targetGenomeGeneLocus Doench '16-Score Moreno-Mateos-Score Out-of-Frame-Score Lindel-Score

chr6:44307303-44307326:+ 21forw GGATGTGGTCAGCCACCACGCGG 82 109 exon:AARS2/TMEM151B/RP11-444E17.6 77 57 63 75 GrafOK chr6:44307318-44307341:+ 1rev CACAGACACAGCGTACCGCGTGG 96 42 exon:AARS2/TMEM151B/RP11-444E17.6 75 50 61 88 GrafOK

maximilianh commented 4 years ago

Hm, I do agree with you, but I'm a bit hesitant to change this now, as this would mean that all old links to guides in these tables wouldn't work anymore...

On Sat, Oct 5, 2019 at 11:56 PM Evgeni Frenkel notifications@github.com wrote:

for bed file inputs, I noticed a kind of inconsistency in the guideId column of the outputs. For example in the test case below, the first genomic range spans exactly 23bp of a protospacer + PAM on the forward strand. Its guideID is then 21forw, so I guess the index in guideId refers to the first base of the PAM. The second genomic range is also for 23bp of a protospacer + PAM but on the reverse strand and its guideId is 1rev, so I guess the index in guideId refers to whatever base of the PAM comes first in genomic coordinates rather than in terms of the guide sequence. It'd be more natural for the guideId to indicate the position of the same part of the guide regardless of strand.

chr6 44307303 44307326 chr6 44307318 44307341

gives

seqId guideId targetSeq mitSpecScore offtargetCount targetGenomeGeneLocus

Doench '16-Score Moreno-Mateos-Score Out-of-Frame-Score Lindel-Score chr6:44307303-44307326:+ 21forw GGATGTGGTCAGCCACCACGCGG 82 109 exon:AARS2/TMEM151B/RP11-444E17.6 77 57 63 75 GrafOK chr6:44307318-44307341:+ 1rev CACAGACACAGCGTACCGCGTGG 96 42 exon:AARS2/TMEM151B/RP11-444E17.6 75 50 61 88 GrafOK

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TJ2R5PHHFYWR7DEMG3QNEERLA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAN4U5Q#issuecomment-538692214, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TKT2MPHJWIYX2V7MQLQNEERLANCNFSM4IPPJ6AA .

genya commented 4 years ago

I think it's ok as long as the documentation is clear about what the guideId values indicate, otherwise it's initially confusing

maximilianh commented 4 years ago

Hm...Is there any documentation about this ID? Where would you expect it? I think I only consider this some random ID, it's only for linking things together. Originally I didn't show it, but then the primers needed a name, so I exposed an ID that used to be internal...

On Mon, Oct 7, 2019 at 3:16 PM Evgeni Frenkel notifications@github.com wrote:

I think it's ok as long as the documentation is clear about what the guideId values indicate, otherwise it's initially confusing

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/35?email_source=notifications&email_token=AACL4TJ3UONWHLQ5YL33HMLQNMZCDA5CNFSM4IPPJ6AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAQJDUQ#issuecomment-539005394, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TKEAGYDGCOZJ7Z6QZLQNMZCDANCNFSM4IPPJ6AA .

genya commented 4 years ago

Even though all the other columns of the output are self-explanatory, there could be a little section listing them and guideId could be mentioned and explained there.

maximilianh commented 4 years ago

Ah! I did document it : http://crispor.tefor.net/manual/#guidelist And yes, my bad, I didn't document that the position is on the reference genome.

Will do now. Thanks!

On Mon, Oct 7, 2019 at 3:57 PM Evgeni Frenkel notifications@github.com wrote:

Even though all the other columns of the output are self-explanatory, there could be a little section listing them and guideId could be mentioned and explained there.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

maximilianh commented 4 years ago

Added note to docs: http://crispor-beta.tefor.net/manual/#guidelist

On Mon, Oct 7, 2019 at 4:35 PM Maximilian Haeussler maximilianh@gmail.com wrote:

Ah! I did document it : http://crispor.tefor.net/manual/#guidelist And yes, my bad, I didn't document that the position is on the reference genome.

Will do now. Thanks!

On Mon, Oct 7, 2019 at 3:57 PM Evgeni Frenkel notifications@github.com wrote:

Even though all the other columns of the output are self-explanatory, there could be a little section listing them and guideId could be mentioned and explained there.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.