tkzeng / Pangolin

Pangolin is a deep-learning method for predicting splice site strengths.
GNU General Public License v3.0
61 stars 32 forks source link

Error for some variants #13

Closed Bin-Guan closed 10 months ago

Bin-Guan commented 1 year ago

When using version 1.0.1, it encounters errors for some of the variants with the following error message, for example the first variant below worked fine but the second produced the error. Used GRCh38 annotation file gencode.v38.annotation.db provided. Could you please help to solve the error? Thanks.

CHROM,POS,REF,ALT chr3,66167878,G,C chr3,66192184,T,C

Traceback (most recent call last): File "/opt/conda/bin/pangolin", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 241, in main scores = process_variant(lnum+1, str(chr), int(pos), ref, alt, gtf, models, args) File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 127, in process_variant loss_pos, gain_pos = compute_score(ref_seq, alt_seq, '+', d, models) File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 30, in compute_score ref_seq = one_hot_encode(ref_seq, strand).T File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 22, in one_hot_encode seq = np.asarray(list(map(int, list(seq)))) ValueError: invalid literal for int() with base 10: 'Y'

mingliu815 commented 11 months ago

Hi, I got a similar error - mine is "ValueError: invalid literal for int() with base 10: 'M'". I was wondering whether you had any clue about it. Thank you!

mingliu815 commented 11 months ago

Figured it out: it only encodes A/C/G/T/N now, but some other possible codes (like Y, M, etc.) exist in the FASTA file as well.

Bin-Guan commented 11 months ago

Can you share how to solve the error? Did you need to modify the codes? Thanks.

mingliu815 commented 11 months ago

I haven't modified any code myself yet, just have some ideas. One possible way could be: in the function one_hot_encode(), add all other possible nucleotide codes to be replaced by 0, such as replace('Y', '0'), etc. Possible codes can be found in the first table here. Another possible way could be: replace all these "undefined" letters in your fasta file with "N" (make a new fasta) so that you don't need to modify any code for now.

Bin-Guan commented 10 months ago

sed '/^[^>]/ s/[^AGTCN]/N/gi' sample.fasta > sample.AGTCNonly.fasta made the trick. Thanks!