maximilianh / crisporWebsite

All source code of the crispor.org website
http://crispor.org
Other
71 stars 42 forks source link

How to deal with the ambiguity character (W,S,M,K,R,Y) and Gap (at least one N) in exon sequences? #33

Closed tiramisutes closed 4 years ago

tiramisutes commented 5 years ago

Dear, In genome assembly, there is usually used the ambiguity character at a position when more than one kind of nucleotide could occur and at least one N to represent the gap. What is the best way to deal with this case?

Any help is much appreciated. Thanks.

Best regards

maximilianh commented 5 years ago

Sorry I don’t know what you mean. Your genome can include Ns. BWA is the aligner I use and it’ll handle it somehow. I think it won’t align against Ns and all other characters get mapped to N.

It’s really a very rare edge case. Give that crispr is hard to predict I wouldn’t worry about little things with the genome, I’d rather worry about genome coverage or quality.

On Tue 9 Jul 2019 at 20:31, hope notifications@github.com wrote:

Dear, In genome assembly, there is usually used the ambiguity character at a position when more than one kind of nucleotide could occur and at least one N to represent the gap. What is the best way to deal with this case?

Any help is much appreciated. Thanks.

Best regards

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/33?email_source=notifications&email_token=AACL4TMF4S2XWHUMIX2S3DLP6VJZXA5CNFSM4H7LCQ3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G6HZ2UQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TMV2QEL74HMHYHWULDP6VJZXANCNFSM4H7LCQ3A .

tiramisutes commented 5 years ago

Sorry I don’t know what you mean. Your genome can include Ns. BWA is the aligner I use and it’ll handle it somehow. I think it won’t align against Ns and all other characters get mapped to N. It’s really a very rare edge case. Give that crispr is hard to predict I wouldn’t worry about little things with the genome, I’d rather worry about genome coverage or quality. On Tue 9 Jul 2019 at 20:31, hope @.***> wrote: Dear, In genome assembly, there is usually used the ambiguity character at a position when more than one kind of nucleotide could occur and at least one N to represent the gap. What is the best way to deal with this case? Any help is much appreciated. Thanks. Best regards — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#33?email_source=notifications&email_token=AACL4TMF4S2XWHUMIX2S3DLP6VJZXA5CNFSM4H7LCQ3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G6HZ2UQ>, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TMV2QEL74HMHYHWULDP6VJZXANCNFSM4H7LCQ3A .

I want to say is that there is some exon sequences have ambiguity character and Gap. Because I have some stderr information.

INFO:root:Progress BZISIXaV4kuHb7aeWtZF - genes - Annotating matches with genes
INFO:root:Progress BZISIXaV4kuHb7aeWtZF - done - Job completed
INFO:root: * running on sequence 'ENSRNA049451954-E1', guideLen=20, seqLen=87
INFO:root:Progress MKddanpWmlIsuBVqfVgG - bwasw - Searching genome for one 100% identical match to input sequence
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[bsw2_aln] read 1 sequences/pairs (87 bp) ...
[main] Version: 0.7.15-r1140
[main] CMD: /public/home/zpxu/software/crisporWebsite/bin/Linux/bwa bwasw -T 20 /public/home/zpxu/software/crisporWebsite/genomes/Arabidopsis_lyrata/Arabidopsis_lyrata.fa /tmp/crisporBestMatchYuE6jj.fa
[main] Real time: 0.146 sec; CPU: 0.136 sec
INFO:root:Progress MKddanpWmlIsuBVqfVgG - effScores - Calculating guide efficiency scores
INFO:root:Progress MKddanpWmlIsuBVqfVgG - outcome - Calculating editing outcomes
WARNING:root:guide GGATCCCATCAGAACTCCGGAGGTTAGCGTGCTTGGGCGAGAGTAGTACTAGGATGGNNN contains at least one N
WARNING:root:guide GATCCCATCAGAACTCCGGAGGTTAGCGTGCTTGGGCGAGAGTAGTACTAGGATGGNNNN contains at least one N
Traceback (most recent call last):
  File "crispor.py", line 8293, in <module>
    main()
  File "crispor.py", line 8291, in main
    mainCommandLine()
  File "crispor.py", line 8100, in mainCommandLine
    getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue())
  File "crispor.py", line 4295, in getOfftargets
    processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue)
  File "crispor.py", line 3835, in processSubmission
    createBatchEffScoreTable(batchId, queue)
  File "crispor.py", line 3454, in createBatchEffScoreTable
    guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue)
  File "crispor.py", line 3396, in calcSaveEffScores
    mutScores = crisporEffScores.calcMutSeqs(pamIds, longSeqs, enz)
  File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 1273, in calcMutSeqs
    mutSeqDict = calcLindelScore(seqIds, seqs)
  File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 710, in calcLindelScore
    return runLindel(seqIds, trimSeqs(seqs, -33, 27))
  File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 683, in runLindel
    assert(seq.count("N")<=3)
AssertionError

and

INFO:root:Progress lgK9ezrS7e0tPBRwcCfR - genes - Annotating matches with genes
INFO:root:Progress lgK9ezrS7e0tPBRwcCfR - done - Job completed
INFO:root: * running on sequence 'g27244.t1-E1', guideLen=20, seqLen=779
INFO:root:Progress uPdPmwvle4LXeM42NqAi - bwasw - Searching genome for one 100% identical match to input sequence
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[bsw2_aln] read 1 sequences/pairs (779 bp) ...
[main] Version: 0.7.15-r1140
[main] CMD: /public/home/zpxu/software/crisporWebsite/bin/Linux/bwa bwasw -T 20 /public/home/zpxu/software/crisporWebsite/genomes/Arabidopsis_halleri/Arabidopsis_halleri.fa /tmp/crisporBestMatchO85k43.fa
[main] Real time: 0.147 sec; CPU: 0.136 sec
INFO:root:Progress uPdPmwvle4LXeM42NqAi - effScores - Calculating guide efficiency scores
INFO:root:Progress uPdPmwvle4LXeM42NqAi - outcome - Calculating editing outcomes
Traceback (most recent call last):
  File "crispor.py", line 8293, in <module>
    main()
  File "crispor.py", line 8291, in main
    mainCommandLine()
  File "crispor.py", line 8100, in mainCommandLine
    getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue())
  File "crispor.py", line 4295, in getOfftargets
    processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue)
  File "crispor.py", line 3835, in processSubmission
    createBatchEffScoreTable(batchId, queue)
  File "crispor.py", line 3454, in createBatchEffScoreTable
    guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue)
  File "crispor.py", line 3396, in calcSaveEffScores
    mutScores = crisporEffScores.calcMutSeqs(pamIds, longSeqs, enz)
  File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 1273, in calcMutSeqs
    mutSeqDict = calcLindelScore(seqIds, seqs)
  File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 710, in calcLindelScore
    return runLindel(seqIds, trimSeqs(seqs, -33, 27))
  File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 688, in runLindel
    y_hat, fs = Lindel.Predictor.gen_prediction(seq,weights,prerequesites)
  File "/public/home/zpxu/software/crisporWebsite/bin/src/lindel/Lindel/Predictor.py", line 201, in gen_prediction
    raise Exception('Error: No NGG at position 33 (0-based). Guide: %s' % guide)
Exception: Error: No NGG at position 33 (0-based). Guide: TTGGTTTCATTTTCTCTAAT
maximilianh commented 5 years ago

Ok this is the first genome with that many Ns. What do you think should happen? Reject guided with more than a single N in it ? Set their scores to 0?

On Tue 9 Jul 2019 at 22:29, hope notifications@github.com wrote:

Sorry I don’t know what you mean. Your genome can include Ns. BWA is the aligner I use and it’ll handle it somehow. I think it won’t align against Ns and all other characters get mapped to N. It’s really a very rare edge case. Give that crispr is hard to predict I wouldn’t worry about little things with the genome, I’d rather worry about genome coverage or quality. … <#m8908783832165583692> On Tue 9 Jul 2019 at 20:31, hope @.***> wrote: Dear, In genome assembly, there is usually used the ambiguity character at a position when more than one kind of nucleotide could occur and at least one N to represent the gap. What is the best way to deal with this case? Any help is much appreciated. Thanks. Best regards — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#33 https://github.com/maximilianh/crisporWebsite/issues/33?email_source=notifications&email_token=AACL4TMF4S2XWHUMIX2S3DLP6VJZXA5CNFSM4H7LCQ3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G6HZ2UQ>, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TMV2QEL74HMHYHWULDP6VJZXANCNFSM4H7LCQ3A .

I want to say is that there is some exon sequences have ambiguity character and Gap. Because I have some stderr information.

INFO:root:Progress BZISIXaV4kuHb7aeWtZF - genes - Annotating matches with genes

INFO:root:Progress BZISIXaV4kuHb7aeWtZF - done - Job completed

INFO:root: * running on sequence 'ENSRNA049451954-E1', guideLen=20, seqLen=87

INFO:root:Progress MKddanpWmlIsuBVqfVgG - bwasw - Searching genome for one 100% identical match to input sequence

[M::bwa_idx_load_from_disk] read 0 ALT contigs

[bsw2_aln] read 1 sequences/pairs (87 bp) ...

[main] Version: 0.7.15-r1140

[main] CMD: /public/home/zpxu/software/crisporWebsite/bin/Linux/bwa bwasw -T 20 /public/home/zpxu/software/crisporWebsite/genomes/Arabidopsis_lyrata/Arabidopsis_lyrata.fa /tmp/crisporBestMatchYuE6jj.fa

[main] Real time: 0.146 sec; CPU: 0.136 sec

INFO:root:Progress MKddanpWmlIsuBVqfVgG - effScores - Calculating guide efficiency scores

INFO:root:Progress MKddanpWmlIsuBVqfVgG - outcome - Calculating editing outcomes

WARNING:root:guide GGATCCCATCAGAACTCCGGAGGTTAGCGTGCTTGGGCGAGAGTAGTACTAGGATGGNNN contains at least one N

WARNING:root:guide GATCCCATCAGAACTCCGGAGGTTAGCGTGCTTGGGCGAGAGTAGTACTAGGATGGNNNN contains at least one N

Traceback (most recent call last):

File "crispor.py", line 8293, in

main()

File "crispor.py", line 8291, in main

mainCommandLine()

File "crispor.py", line 8100, in mainCommandLine

getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue())

File "crispor.py", line 4295, in getOfftargets

processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue)

File "crispor.py", line 3835, in processSubmission

createBatchEffScoreTable(batchId, queue)

File "crispor.py", line 3454, in createBatchEffScoreTable

guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue)

File "crispor.py", line 3396, in calcSaveEffScores

mutScores = crisporEffScores.calcMutSeqs(pamIds, longSeqs, enz)

File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 1273, in calcMutSeqs

mutSeqDict = calcLindelScore(seqIds, seqs)

File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 710, in calcLindelScore

return runLindel(seqIds, trimSeqs(seqs, -33, 27))

File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 683, in runLindel

assert(seq.count("N")<=3)

AssertionError

and

INFO:root:Progress lgK9ezrS7e0tPBRwcCfR - genes - Annotating matches with genes

INFO:root:Progress lgK9ezrS7e0tPBRwcCfR - done - Job completed

INFO:root: * running on sequence 'g27244.t1-E1', guideLen=20, seqLen=779

INFO:root:Progress uPdPmwvle4LXeM42NqAi - bwasw - Searching genome for one 100% identical match to input sequence

[M::bwa_idx_load_from_disk] read 0 ALT contigs

[bsw2_aln] read 1 sequences/pairs (779 bp) ...

[main] Version: 0.7.15-r1140

[main] CMD: /public/home/zpxu/software/crisporWebsite/bin/Linux/bwa bwasw -T 20 /public/home/zpxu/software/crisporWebsite/genomes/Arabidopsis_halleri/Arabidopsis_halleri.fa /tmp/crisporBestMatchO85k43.fa

[main] Real time: 0.147 sec; CPU: 0.136 sec

INFO:root:Progress uPdPmwvle4LXeM42NqAi - effScores - Calculating guide efficiency scores

INFO:root:Progress uPdPmwvle4LXeM42NqAi - outcome - Calculating editing outcomes

Traceback (most recent call last):

File "crispor.py", line 8293, in

main()

File "crispor.py", line 8291, in main

mainCommandLine()

File "crispor.py", line 8100, in mainCommandLine

getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue())

File "crispor.py", line 4295, in getOfftargets

processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue)

File "crispor.py", line 3835, in processSubmission

createBatchEffScoreTable(batchId, queue)

File "crispor.py", line 3454, in createBatchEffScoreTable

guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue)

File "crispor.py", line 3396, in calcSaveEffScores

mutScores = crisporEffScores.calcMutSeqs(pamIds, longSeqs, enz)

File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 1273, in calcMutSeqs

mutSeqDict = calcLindelScore(seqIds, seqs)

File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 710, in calcLindelScore

return runLindel(seqIds, trimSeqs(seqs, -33, 27))

File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 688, in runLindel

y_hat, fs = Lindel.Predictor.gen_prediction(seq,weights,prerequesites)

File "/public/home/zpxu/software/crisporWebsite/bin/src/lindel/Lindel/Predictor.py", line 201, in gen_prediction

raise Exception('Error: No NGG at position 33 (0-based). Guide: %s' % guide)

Exception: Error: No NGG at position 33 (0-based). Guide: TTGGTTTCATTTTCTCTAAT

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/33?email_source=notifications&email_token=AACL4TLGC5PBPRXDV7TJ6LDP6VXTJA5CNFSM4H7LCQ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSKXLI#issuecomment-509914029, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TK2W5DSK7F7BPV67LLP6VXTJANCNFSM4H7LCQ3A .

tiramisutes commented 5 years ago

I think it's normal that included N in the genome sequence, even for Arabidopsis thaliana. And we should reject the N appears in the guided sequence.

INFO:root: * running on sequence 'AT1G76040.1.exon1', guideLen=20, seqLen=949
INFO:root:Progress EgR745rzsQAIIbGKJh0g - bwasw - Searching genome for one 100% identical match to input sequence
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[bsw2_aln] read 1 sequences/pairs (949 bp) ...
[main] Version: 0.7.15-r1140
[main] CMD: /public/home/software/crisporWebsite/bin/Linux/bwa bwasw -T 20 /public/home/software/crisporWebsite/genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.fa /tmp/crisporBestMatchTM11nw.fa
[main] Real time: 0.079 sec; CPU: 0.080 sec
INFO:root:Progress EgR745rzsQAIIbGKJh0g - effScores - Calculating guide efficiency scores
INFO:root:Progress EgR745rzsQAIIbGKJh0g - outcome - Calculating editing outcomes
WARNING:root:guide GAAAAATAATGTTGCCTTTGGTTGGTTTTGTGGGGGTGCTTNNNNNNNNNNNNNNNNNNN contains at least one N

In addition, if I want to run many exon sequences in a single fasta file, there is no any resulted will output when abort due to error, even some sequences run successful. I think it should be output real-time and then I can begain from the error sequence.

tiramisutes commented 5 years ago

Dear, Besides, I also get the stderr ValueError: 'K' is not in list and ValueError: 'R' (stderr for another exon sequence) is not in list. But I have check the sequences and no K or R strings in it.

INFO:root: * running on sequence 'AT2G48110.1.TAIR10.CDS.12', guideLen=20, seqLen=661
INFO:root:Progress buml1MSjCIddEtqmaN7f - bwasw - Searching genome for one 100% identical match to input sequence
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[bsw2_aln] read 1 sequences/pairs (661 bp) ...
[main] Version: 0.7.15-r1140
[main] CMD: bwa bwasw -T 20 Arabidopsis_thaliana.fa /tmp/crisporBestMatchQ4RqaY.fa
[main] Real time: 0.140 sec; CPU: 0.141 sec
INFO:root:Progress buml1MSjCIddEtqmaN7f - effScores - Calculating guide efficiency scores
Traceback (most recent call last):
  File "crispor.py", line 8293, in <module>
    main()
  File "crispor.py", line 8291, in main
    mainCommandLine()
  File "crispor.py", line 8100, in mainCommandLine
    getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue())
  File "crispor.py", line 4295, in getOfftargets
    processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue)
  File "crispor.py", line 3835, in processSubmission
    createBatchEffScoreTable(batchId, queue)
  File "crispor.py", line 3454, in createBatchEffScoreTable
    guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue)
  File "crispor.py", line 3392, in calcSaveEffScores
    effScores = crisporEffScores.calcAllScores(longSeqs, enzyme=enz, scoreNames=scoreNames)
  File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 885, in calcAllScores
    scores["fusi"] = calcAziScore(trimSeqs(seqs, -24, 6))
  File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 1117, in calcAziScore
    score = azimuth.model_comparison.predict(numpy.array([seq]), None, None, pam_audit=False)
  File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/model_comparison.py", line 559, in predict
    feature_sets = feat.featurize_data(Xdf, learn_options, pandas.DataFrame(), gene_position, pam_audit=pam_audit, length_audit=length_audit)
  File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 31, in featurize_data
    get_all_order_nuc_features(data['30mer'], feature_sets, learn_options, learn_options["order"], max_index_to_use=30, quiet=quiet)
  File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 153, in get_all_order_nuc_features
    include_pos_independent=True, max_index_to_use=max_index_to_use, prefix=prefix)
  File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 423, in apply_nucleotide_features
    feat_pd = seq_data_frame.apply(nucleotide_features, args=(order, max_index_to_use, prefix, 'pos_dependent'))
  File "/public/home/zpxu/.local/lib/python2.7/site-packages/pandas/core/series.py", line 3591, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2217, in pandas._libs.lib.map_infer
  File "/public/home/zpxu/.local/lib/python2.7/site-packages/pandas/core/series.py", line 3578, in f
    return func(x, *args, **kwds)
  File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 468, in nucleotide_features
    features_pos_dependent[alphabet.index(nucl) + (position*len(alphabet))] = 1.0
ValueError: 'K' is not in list
>AT2G48110.1.TAIR10.CDS.12
GTCCCATTACTTGCTGGTGCTTTGATGCCAATATGTGAAGCGTTTGGCTC
CGGCGTTCCAAACATTACGTGGACTCTCCCGACTGGCGAATTAATCTCCT
CTCATGCTGTTTTCTCCACTGCATTTACACTTCTTCTGAGGCTATGGAGA
TTTGATCACCCACCACTAGATTACGTCTTGGGAGATGTTCCCCCGGTGGG
CCCTCAACCCAGCCCTGAGTATCTGTTGTTAGTAAGAAATTGCCGTCTGG
AATGTTTTGGAAAGTCCCCAAAGGATCGCATGGCACGTCGAAGATTTTCG
AAAGTGATAGATATCTCTGTGGATCCCATCTTCATGGATTCATTCCCCAG
ACTGAAACAGTGGTACCGGCAGCATCAGGAATGTATGGCTTCAATTCTCT
CTGAACTAAAGACAGGAAGCCCAGTGCATCACATTGTCGATTCCCTCCTT
AGCATGATGTTCAAGAAGGCAAACAAAGGTGGTAGTCAGTCACTGACCCC
ATCTTCAGGGAGCAGTAGTTTATCTACTTCTGGAGGTGATGACTCGTCTG
ATCAACTCAAGTTACCTGCATGGGATATCTTGGAAGCGGCMCCGTTTGTG
CTTGATGCTGCTCTAACTGCTTGTGCTCATGGATCACTCTCTCCCCGGGA
ACTAGCAACAG
maximilianh commented 5 years ago

Is it possible that these characters are in the flanking sequences? Crispor will retrieve +- 1000 bp to get flanking sequences. This is the first genome where I see IUPAC characters that are not N. It would be easy to modify crispor to get rid of these characters (e.g. replcae with A,C,T or G) in order to get Azimuth to run. Or remove the whole guide.

On Thu, Jul 18, 2019 at 8:24 PM hope notifications@github.com wrote:

Dear, Besides, I also get the stderr ValueError: 'K' is not in list and ValueError: 'R' is not in list. But I have check the sequences and no K or R strings in it.

INFO:root: running on sequence 'AT2G48110.1.TAIR10.CDS.12', guideLen=20, seqLen=661 INFO:root:Progress buml1MSjCIddEtqmaN7f - bwasw - Searching genome for one 100% identical match to input sequence [M::bwa_idx_load_from_disk] read 0 ALT contigs [bsw2_aln] read 1 sequences/pairs (661 bp) ... [main] Version: 0.7.15-r1140 [main] CMD: bwa bwasw -T 20 Arabidopsis_thaliana.fa /tmp/crisporBestMatchQ4RqaY.fa [main] Real time: 0.140 sec; CPU: 0.141 sec INFO:root:Progress buml1MSjCIddEtqmaN7f - effScores - Calculating guide efficiency scores Traceback (most recent call last): File "crispor.py", line 8293, in main() File "crispor.py", line 8291, in main mainCommandLine() File "crispor.py", line 8100, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4295, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3835, in processSubmission createBatchEffScoreTable(batchId, queue) File "crispor.py", line 3454, in createBatchEffScoreTable guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue) File "crispor.py", line 3392, in calcSaveEffScores effScores = crisporEffScores.calcAllScores(longSeqs, enzyme=enz, scoreNames=scoreNames) File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 885, in calcAllScores scores["fusi"] = calcAziScore(trimSeqs(seqs, -24, 6)) File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 1117, in calcAziScore score = azimuth.model_comparison.predict(numpy.array([seq]), None, None, pam_audit=False) File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/model_comparison.py", line 559, in predict feature_sets = feat.featurize_data(Xdf, learn_options, pandas.DataFrame(), gene_position, pam_audit=pam_audit, length_audit=length_audit) File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 31, in featurize_data get_all_order_nuc_features(data['30mer'], feature_sets, learn_options, learn_options["order"], max_index_to_use=30, quiet=quiet) File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 153, in get_all_order_nuc_features include_pos_independent=True, max_index_to_use=max_index_to_use, prefix=prefix) File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 423, in apply_nucleotide_features feat_pd = seq_data_frame.apply(nucleotide_features, args=(order, max_index_to_use, prefix, 'pos_dependent')) File "/public/home/zpxu/.local/lib/python2.7/site-packages/pandas/core/series.py", line 3591, in apply mapped = lib.map_infer(values, f, convert=convert_dtype) File "pandas/_libs/lib.pyx", line 2217, in pandas._libs.lib.map_infer File "/public/home/zpxu/.local/lib/python2.7/site-packages/pandas/core/series.py", line 3578, in f return func(x, args, *kwds) File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 468, in nucleotide_features features_pos_dependent[alphabet.index(nucl) + (positionlen(alphabet))] = 1.0 ValueError: 'K' is not in list

AT2G48110.1.TAIR10.CDS.12 GTCCCATTACTTGCTGGTGCTTTGATGCCAATATGTGAAGCGTTTGGCTC CGGCGTTCCAAACATTACGTGGACTCTCCCGACTGGCGAATTAATCTCCT CTCATGCTGTTTTCTCCACTGCATTTACACTTCTTCTGAGGCTATGGAGA TTTGATCACCCACCACTAGATTACGTCTTGGGAGATGTTCCCCCGGTGGG CCCTCAACCCAGCCCTGAGTATCTGTTGTTAGTAAGAAATTGCCGTCTGG AATGTTTTGGAAAGTCCCCAAAGGATCGCATGGCACGTCGAAGATTTTCG AAAGTGATAGATATCTCTGTGGATCCCATCTTCATGGATTCATTCCCCAG ACTGAAACAGTGGTACCGGCAGCATCAGGAATGTATGGCTTCAATTCTCT CTGAACTAAAGACAGGAAGCCCAGTGCATCACATTGTCGATTCCCTCCTT AGCATGATGTTCAAGAAGGCAAACAAAGGTGGTAGTCAGTCACTGACCCC ATCTTCAGGGAGCAGTAGTTTATCTACTTCTGGAGGTGATGACTCGTCTG ATCAACTCAAGTTACCTGCATGGGATATCTTGGAAGCGGCMCCGTTTGTG CTTGATGCTGCTCTAACTGCTTGTGCTCATGGATCACTCTCTCCCCGGGA ACTAGCAACAG

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/33?email_source=notifications&email_token=AACL4TOTJCYGPACLJA7BOZDQAEXVBA5CNFSM4H7LCQ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2KOV4Y#issuecomment-513075955, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TMETV5KL2CMLNFA2N3QAEXVBANCNFSM4H7LCQ3A .

tiramisutes commented 5 years ago

Ok, I can replace all the ambiguity characters (W, S, M, K, R, Y) with N in the whole genome. Best suggestion for my question above (https://github.com/maximilianh/crisporWebsite/issues/33#issuecomment-512209104)?

maximilianh commented 5 years ago

Sorry I have a lot of trouble understanding your English. Also sorry that my replies come in so slowly, I'm not fully working this week. Yes, I agree with you, this should be handled better. As I mentioned before, Crispor should just ignore guides that include any IUPAC characters in them. But thanks for reminding me of this and

maximilianh commented 4 years ago

Haven't heard back from user in a while and haven't seen any genomes with strange IUPAC characters in them. Also, this looks like a problem in the Azimuth code more than in Crispor. I could work around it, but it doesn't seem high priority right now.

tiramisutes commented 3 years ago

I have replaced all the ambiguous characters (W, S, M, K, R, Y) with N in the whole genome fasta file. But still get this stderr.

Is it possible that these characters are in the flanking sequences? Crispor will retrieve +- 1000 bp to get flanking sequences. This is the first genome where I see IUPAC characters that are not N. It would be easy to modify crispor to get rid of these characters (e.g. replcae with A,C,T or G) in order to get Azimuth to run. Or remove the whole guide. On Thu, Jul 18, 2019 at 8:24 PM hope @.*> wrote: Dear, Besides, I also get the stderr ValueError: 'K' is not in list and ValueError: 'R' is not in list. But I have check the sequences and no K or R strings in it. INFO:root: running on sequence 'AT2G48110.1.TAIR10.CDS.12', guideLen=20, seqLen=661 INFO:root:Progress buml1MSjCIddEtqmaN7f - bwasw - Searching genome for one 100% identical match to input sequence [M::bwa_idx_load_from_disk] read 0 ALT contigs [bsw2_aln] read 1 sequences/pairs (661 bp) ... [main] Version: 0.7.15-r1140 [main] CMD: bwa bwasw -T 20 Arabidopsis_thaliana.fa /tmp/crisporBestMatchQ4RqaY.fa [main] Real time: 0.140 sec; CPU: 0.141 sec INFO:root:Progress buml1MSjCIddEtqmaN7f - effScores - Calculating guide efficiency scores Traceback (most recent call last): File "crispor.py", line 8293, in main() File "crispor.py", line 8291, in main mainCommandLine() File "crispor.py", line 8100, in mainCommandLine getOfftargets(seq, org, pamPat, batchId, startDict, ConsQueue()) File "crispor.py", line 4295, in getOfftargets processSubmission(faFname, org, pamDesc, otBedFname, batchBase, batchId, queue) File "crispor.py", line 3835, in processSubmission createBatchEffScoreTable(batchId, queue) File "crispor.py", line 3454, in createBatchEffScoreTable guideRows = calcSaveEffScores(batchId, seq, extSeq, pam, queue) File "crispor.py", line 3392, in calcSaveEffScores effScores = crisporEffScores.calcAllScores(longSeqs, enzyme=enz, scoreNames=scoreNames) File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 885, in calcAllScores scores["fusi"] = calcAziScore(trimSeqs(seqs, -24, 6)) File "/public/home/zpxu/software/crisporWebsite/crisporEffScores.py", line 1117, in calcAziScore score = azimuth.model_comparison.predict(numpy.array([seq]), None, None, pam_audit=False) File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/model_comparison.py", line 559, in predict feature_sets = feat.featurize_data(Xdf, learn_options, pandas.DataFrame(), gene_position, pam_audit=pam_audit, length_audit=length_audit) File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 31, in featurize_data get_all_order_nuc_features(data['30mer'], feature_sets, learn_options, learn_options["order"], max_index_to_use=30, quiet=quiet) File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 153, in get_all_order_nuc_features include_pos_independent=True, max_index_to_use=max_index_to_use, prefix=prefix) File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 423, in apply_nucleotide_features feat_pd = seq_data_frame.apply(nucleotide_features, args=(order, max_index_to_use, prefix, 'pos_dependent')) File "/public/home/zpxu/.local/lib/python2.7/site-packages/pandas/core/series.py", line 3591, in apply mapped = lib.map_infer(values, f, convert=convert_dtype) File "pandas/_libs/lib.pyx", line 2217, in pandas._libs.lib.map_infer File "/public/home/zpxu/.local/lib/python2.7/site-packages/pandas/core/series.py", line 3578, in f return func(x, args, kwds) File "/public/home/zpxu/software/crisporWebsite/bin/Azimuth-2.0/azimuth/features/featurization.py", line 468, in nucleotide_features features_pos_dependent[alphabet.index(nucl) + (position*len(alphabet))] = 1.0 ValueError: 'K' is not in list >AT2G48110.1.TAIR10.CDS.12 GTCCCATTACTTGCTGGTGCTTTGATGCCAATATGTGAAGCGTTTGGCTC CGGCGTTCCAAACATTACGTGGACTCTCCCGACTGGCGAATTAATCTCCT CTCATGCTGTTTTCTCCACTGCATTTACACTTCTTCTGAGGCTATGGAGA TTTGATCACCCACCACTAGATTACGTCTTGGGAGATGTTCCCCCGGTGGG CCCTCAACCCAGCCCTGAGTATCTGTTGTTAGTAAGAAATTGCCGTCTGG AATGTTTTGGAAAGTCCCCAAAGGATCGCATGGCACGTCGAAGATTTTCG AAAGTGATAGATATCTCTGTGGATCCCATCTTCATGGATTCATTCCCCAG ACTGAAACAGTGGTACCGGCAGCATCAGGAATGTATGGCTTCAATTCTCT CTGAACTAAAGACAGGAAGCCCAGTGCATCACATTGTCGATTCCCTCCTT AGCATGATGTTCAAGAAGGCAAACAAAGGTGGTAGTCAGTCACTGACCCC ATCTTCAGGGAGCAGTAGTTTATCTACTTCTGGAGGTGATGACTCGTCTG ATCAACTCAAGTTACCTGCATGGGATATCTTGGAAGCGGCMCCGTTTGTG CTTGATGCTGCTCTAACTGCTTGTGCTCATGGATCACTCTCTCCCCGGGA ACTAGCAACAG — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#33?email_source=notifications&email_token=AACL4TOTJCYGPACLJA7BOZDQAEXVBA5CNFSM4H7LCQ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2KOV4Y#issuecomment-513075955>, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TMETV5KL2CMLNFA2N3QAEXVBANCNFSM4H7LCQ3A .

maximilianh commented 3 years ago

If you replace the characters, you'll also have to remove them from your input sequence file. The software clearly found a K here in some sequence, so either that's in the genome (and you missed something when removing them) or the K is in the imput sequence.