huang-sh / GPTBioInsightor

Gain biological insights using LLM.
BSD 3-Clause "New" or "Revised" License
5 stars 0 forks source link

UnicodeEncodeError: 'ascii' codec can't encode character '\u03b2' #1

Open Starlitnightly opened 1 day ago

Starlitnightly commented 1 day ago

Hi,

Thank you for developing this tool, but I encountered the following issue while running it.

Sincerely, Zehua

code

marker_dict={'0': ['SERPINA1',
  'VTN',
  'APOE',
  'APOC1',
  'APOA1',
  'GC',
  'CLU',
  'TTR',
  'RARRES2',
  'FGG'],
 '1': ['CXCR4',
  'CCL5',
  'SPOCK2',
  'IL7R',
  'CD2',
  'CD69',
  'S100A4',
  'ETS1',
  'RGS1',
  'LTB'],
 '10': ['SERPINA1',
  'APOC1',
  'VTN',
  'CLU',
  'SERPINA3',
  'LYZ',
  'IGF2',
  'SPINK1',
  'SPP1',
  'CCL20'],
 '11': ['SAA1/2',
  'APOE',
  'SERPINA1',
  'FGG',
  'CLU',
  'GC',
  'CRP',
  'AZGP1',
  'SOD2',
  'CCND1'],
 '12': ['AHI1',
  'EFNB1',
  'LYN',
  'RPS4Y1',
  'CCL11',
  'CCL8',
  'ACKR3',
  'CLEC4A',
  'TLR2',
  'EPHB6'],
 '13': ['ANXA4',
  'SPP1',
  'KRT8',
  'KRT7',
  'S100A6',
  'CD24',
  'MMP7',
  'KRT18',
  'FGFR2',
  'FGFR3'],
 '14': ['IGFBP7',
  'DCN',
  'CXCL12',
  'CD81',
  'BGN',
  'IGFBP3',
  'C11orf96',
  'LUM',
  'THBS1',
  'COL6A1'],
 '15': ['IGHG1', 'IGHG2', 'MZB1', 'JCHAIN'],
 '16': ['CD74',
  'IGKC',
  'MS4A1',
  'CD37',
  'IGHM',
  'LTB',
  'CXCR4',
  'IGHA1',
  'HLA-DQB1/2',
  'CIITA'],
 '17': ['IGHA1', 'IGKC', 'JCHAIN', 'MZB1'],
 '18': ['VTN',
  'IL32',
  'CCL5',
  'SPINK1',
  'NKG7',
  'CD8A',
  'CD2',
  'PTPRCAP',
  'S100P',
  'RAC2'],
 '19': ['APOC1',
  'SERPINA1',
  'APOA1',
  'FGG',
  'TTR',
  'SERPINA3',
  'SAA1/2',
  'VTN',
  'APOE',
  'GC'],
 '2': ['CD74',
  'HLA-DRB',
  'HLA-DRA',
  'PSAP',
  'LYZ',
  'C1QC',
  'HLA-DPA1',
  'SRGN',
  'C1QB',
  'C1QA'],
 '20': ['TPSAB1/B2',
  'SRGN',
  'CPA3',
  'KIT',
  'CTSG',
  'RGS2',
  'IL1RL1',
  'RGS1',
  'NFKBIA',
  'CD69'],
 '21': ['SERPINA1',
  'CLU',
  'RARRES2',
  'APOA1',
  'APOE',
  'SPINK1',
  'PTGDS',
  'VTN',
  'IFI27',
  'CTSD'],
 '3': ['COL1A1',
  'COL3A1',
  'COL1A2',
  'TIMP1',
  'IGFBP7',
  'DCN',
  'COL6A3',
  'COL6A1',
  'THBS2',
  'COL6A2'],
 '4': ['GPX7', 'EPHA7', 'GCG'],
 '5': ['SERPINA1',
  'MT2A',
  'APOC1',
  'FGG',
  'APOA1',
  'SERPINA3',
  'GC',
  'MT1X',
  'SAA1/2',
  'INSIG1'],
 '6': ['IGFBP7',
  'HSPG2',
  'COL4A1',
  'VIM',
  'PECAM1',
  'COL4A2',
  'GSN',
  'CD34',
  'MARCKSL1',
  'CD93'],
 '7': ['KRT17', 'TPSAB1/B2'],
 '8': ['IGKC', 'IGHG1', 'IGHG2', 'MZB1', 'JCHAIN'],
 '9': ['MT2A',
  'C11orf96',
  'TAGLN',
  'ACTA2',
  'MYL9',
  'MYH11',
  'NOTCH3',
  'SPARCL1']}

# 设置数据的背景信息
background = "Cells from Liver" 

# 确保adata已经进行了差异基因分析,例如: sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon")
# 我们使用Aliyun 的 qwen2-72b-instruct 进行演示, 但你也可以使用 openai gpt-4o
res = gbi.get_celltype(adata.uns['marker_dict'], background=background, out="celltype.md", 
                       topnumber=15,provider="aliyun", model="qwen-plus")
res

Error


UnicodeEncodeError Traceback (most recent call last) Cell In[133], line 12 8 background = "Cells from Liver" 10 # 确保adata已经进行了差异基因分析,例如: sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon") 11 # 我们使用Aliyun 的 qwen2-72b-instruct 进行演示, 但你也可以使用 openai gpt-4o ---> 12 res = gbi.get_celltype(adata, background=background, out="celltype.md", 13 topnumber=15,provider="aliyun", model="qwen-plus") 14 res

File ~/miniconda3/envs/omicverse/lib/python3.10/site-packages/gptbioinsightor/celltype.py:91, in get_celltype(input, out, background, key, topnumber, n_jobs, provider, model, group, base_url, rm_genes) 89 for gsid, res in zip(gene_dic.values(), results): 90 res = res.strip("```").strip("'''").strip() ---> 91 print(res, file=handle) 92 ctn = ul.get_celltype_name(res) 93 celltype_ls.append(ctn)

UnicodeEncodeError: 'ascii' codec can't encode character '\u03b2' in position 1639: ordinal not in range(128)

Starlitnightly commented 1 day ago

UnicodeEncodeError Traceback (most recent call last) Cell In[137], line 12 8 background = "Cells from Liver" 10 # 确保adata已经进行了差异基因分析,例如: sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon") 11 # 我们使用Aliyun 的 qwen2-72b-instruct 进行演示, 但你也可以使用 openai gpt-4o ---> 12 res = gbi.get_celltype(adata.uns['marker_dict'], background=background, out="celltype.md", 13 topnumber=15,provider="aliyun", model="qwen-plus") 14 res

File ~/miniconda3/envs/omicverse/lib/python3.10/site-packages/gptbioinsightor/celltype.py:91, in get_celltype(input, out, background, key, topnumber, n_jobs, provider, model, group, base_url, rm_genes) 89 for gsid, res in zip(gene_dic.values(), results): 90 res = res.strip("```").strip("'''").strip() ---> 91 print(res, file=handle) 92 ctn = ul.get_celltype_name(res) 93 celltype_ls.append(ctn)

UnicodeEncodeError: 'ascii' codec can't encode character '\xef' in position 2211: ordinal not in range(128)

huang-sh commented 1 day ago

Hi Zehua,

Thank you for your feedback. I think it is a problem about encoding.
I get similar error when I set ascii encoding way in my python environment. Could you try to set utf-8 encoding in your environment?, like

import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

And thanks again. I will solve the issue in next release.

Best Shenghui

huang-sh commented 1 day ago

Hi Zehua,

Could you install latest version and help to test if solve your problem?

pip install gptbioinsightor==0.5.1

Best Shenghui