microsoft / lida

Automatic Generation of Visualizations and Infographics using Large Language Models
https://microsoft.github.io/lida/
MIT License
2.57k stars 261 forks source link

Lida Summarizer, data type convertion error #117

Open Dejian0328 opened 1 month ago

Dejian0328 commented 1 month ago

Does anyone facing this issue? I plan to do a summarization on the dataframe, end up having a datatype issue. Can you please advice on this.

df = pd.DataFrame.from_records(data, columns=columns) data_summary = lida.summarize(df, summary_method="llm", textgen_config=textgen_config)

df: ContributionID MemberID EmployerID ContributionMonth EmployeeShare \ 0 1 27 15 May 883.43
1 2 44 2 December 626.79
2 3 1 17 January 732.94
3 4 28 15 September 149.57
4 5 49 15 September 616.06
5 6 45 8 February 154.46
6 7 41 16 August 941.70
7 8 2 3 July 707.85
8 9 2 8 May 186.81
9 10 22 7 June 558.11

EmployerShare TotalContribution ContributionDate
0 536.68 1420.11 2021-05-13
1 368.82 995.61 2024-12-23
2 716.15 1449.09 2021-01-03
3 258.10 407.67 2022-09-27
4 519.45 1135.51 2022-09-09
5 840.50 994.96 2022-02-25
6 990.86 1932.56 2020-08-17
7 960.77 1668.62 2021-07-08
8 349.01 535.82 2021-05-16
9 585.05 1143.16 2022-06-30

error log:

\lida\components\manager.py:131, in Manager.summarize(self, data, file_name, n_samples, summary_method, textgen_config) [128] data = read_dataframe(data) [130] self.data = data --> [131] return self.summarizer.summarize( [132] data=self.data, text_gen=self.text_gen, file_name=file_name, n_samples=n_samples, [133] summary_method=summary_method, textgen_config=textgen_config)

\lida\components\summarizer.py:130, in Summarizer.summarize(self, data, text_gen, file_name, n_samples, textgen_config, summary_method, encoding) [128] # modified to include encoding [129] data = read_dataframe(data, encoding=encoding) --> [130] data_properties = self.get_column_properties(data, n_samples) [132 # default single stage summary construction ... File tslib.pyx:596, in pandas._libs.tslib.array_to_datetime()

File tslib.pyx:588, in pandas._libs.tslib.array_to_datetime()

TypeError: <class 'decimal.Decimal'> is not convertible to datetime, at position 0

skyprince999 commented 1 month ago

can you share a copy of the data. Is it a tsv file?

Typically while summarizing the function uses the pandas.to_datetime function to convert it to a datetime object. If it doesnt find it in correct format it raises an error.

Dejian0328 commented 1 month ago

I extract the data from a Azure SQL DB, using pyodbc cursor. The conversion raise an error when the data is in decimal data type. Once I convert them manually into float in the Azure DB, then the summarize function works fine.

The error is raised when I do not exclude EmployeeShare, EmployerShare and TotalContribution columns