Closed Bin-Guan closed 10 months ago
Hi, I got a similar error - mine is "ValueError: invalid literal for int() with base 10: 'M'". I was wondering whether you had any clue about it. Thank you!
Figured it out: it only encodes A/C/G/T/N now, but some other possible codes (like Y, M, etc.) exist in the FASTA file as well.
Can you share how to solve the error? Did you need to modify the codes? Thanks.
I haven't modified any code myself yet, just have some ideas. One possible way could be: in the function one_hot_encode(), add all other possible nucleotide codes to be replaced by 0, such as replace('Y', '0'), etc. Possible codes can be found in the first table here. Another possible way could be: replace all these "undefined" letters in your fasta file with "N" (make a new fasta) so that you don't need to modify any code for now.
sed '/^[^>]/ s/[^AGTCN]/N/gi' sample.fasta > sample.AGTCNonly.fasta made the trick. Thanks!
When using version 1.0.1, it encounters errors for some of the variants with the following error message, for example the first variant below worked fine but the second produced the error. Used GRCh38 annotation file gencode.v38.annotation.db provided. Could you please help to solve the error? Thanks.
CHROM,POS,REF,ALT chr3,66167878,G,C chr3,66192184,T,C
Traceback (most recent call last): File "/opt/conda/bin/pangolin", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 241, in main
scores = process_variant(lnum+1, str(chr), int(pos), ref, alt, gtf, models, args)
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 127, in process_variant
loss_pos, gain_pos = compute_score(ref_seq, alt_seq, '+', d, models)
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 30, in compute_score
ref_seq = one_hot_encode(ref_seq, strand).T
File "/opt/conda/lib/python3.9/site-packages/pangolin/pangolin.py", line 22, in one_hot_encode
seq = np.asarray(list(map(int, list(seq))))
ValueError: invalid literal for int() with base 10: 'Y'