h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.81k stars 155 forks source link

Error "Invalid data access for a virtual column" when creating datatable.Frame() from pandas.DataFrame #3470

Open ooooona opened 1 year ago

ooooona commented 1 year ago

Hi, I found a bug when I'm trying to convert pandas.DataFrame to datatable.Frame().

Succeeded panda.DataFrame age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome 0 23 self-employed single secondary no -921 no no telephone 2 jan 9 1 56 unknown success 1 26 entrepreneur married secondary no -1206 no no cellular 5 apr 16 9 56 unknown other 2 25 admin. single primary no -932 no no telephone 1 jun 14 5 1 unknown other 3 24 retired divorced secondary no -701 no no cellular 6 may 11 1 1 unknown failure 4 28 entrepreneur single primary no -932 yes yes telephone 5 jan 15 1 69 unknown other 5 29 self-employed single secondary no -701 yes yes cellular 10 may 7 2 2 unknown success 6 21 housemaid divorced primary no -679 yes yes telephone 3 aug 16 1 85 unknown other 7 27 services married secondary no -665 yes yes cellular 10 may 9 4 81 unknown success 8 29 admin. married primary no -710 no no telephone 2 nov 14 10 73 unknown success 9 26 technician divorced primary no -921 yes yes telephone 4 may 11 2 81 unknown success 10 28 admin. divorced primary no -701 no no telephone 6 dec 10 10 -1 unknown failure 11 20 housemaid divorced tertiary no -679 yes yes telephone 10 apr 9 4 81 unknown success 12 25 entrepreneur married primary no -710 no no cellular 4 dec 12 8 73 unknown other 13 20 housemaid married tertiary no -679 yes yes telephone 1 dec 12 10 85 unknown other 14 29 blue-collar married primary no -932 no no telephone 5 feb 7 4 64 unknown failure

Failed pandas.DataFrame age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome 0 23.0 self-employed single secondary no -921.0 no no telephone 2.0 jan 9.0 1.0 56.0 unknown success 1 26.0 entrepreneur married secondary no -1206.0 no no cellular 5.0 apr 16.0 9.0 56.0 unknown other 2 25.0 admin. single primary no -932.0 no no telephone 1.0 jun 14.0 5.0 1.0 unknown other 3 24.0 retired divorced secondary no -701.0 no no cellular 6.0 may 11.0 1.0 1.0 unknown failure 4 28.0 entrepreneur single primary no -932.0 yes yes telephone 5.0 jan 15.0 1.0 69.0 unknown other 5 29.0 self-employed single secondary no -701.0 yes yes cellular 10.0 may 7.0 2.0 2.0 unknown success 6 21.0 housemaid divorced primary no -679.0 yes yes telephone 3.0 aug 16.0 1.0 85.0 unknown other 7 27.0 services married secondary no -665.0 yes yes cellular 10.0 may 9.0 4.0 81.0 unknown success 8 29.0 admin. married primary no -710.0 no no telephone 2.0 nov 14.0 10.0 73.0 unknown success 9 26.0 technician divorced primary no -921.0 yes yes telephone 4.0 may 11.0 2.0 81.0 unknown success 10 28.0 admin. divorced primary no -701.0 no no telephone 6.0 dec 10.0 10.0 -1.0 unknown failure 11 20.0 housemaid divorced tertiary no -679.0 yes yes telephone 10.0 apr 9.0 4.0 81.0 unknown success 12 25.0 entrepreneur married primary no -710.0 no no cellular 4.0 dec 12.0 8.0 73.0 unknown other 13 20.0 housemaid married tertiary no -679.0 yes yes telephone 1.0 dec 12.0 10.0 85.0 unknown other 14 29.0 blue-collar married primary no -932.0 no no telephone 5.0 feb 7.0 4.0 64.0 unknown failure

But for the second one, if I reduce the batch_size from 15 to 1, it can work!!!

Could you please help to solve it? Thanks so much!

  1. plaste the following to csv file: """csv age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome 23.0,self-employed,single,secondary,no,-921.0,no,no,telephone,2.0,jan,9.0,1.0,56.0,unknown,success 26.0,entrepreneur,married,secondary,no,-1206.0,no,no,cellular,5.0,apr,16.0,9.0,56.0,unknown,other """

  2. use dataframe = pandas.read_csv(${csv_path}) to load the csv file as pandas.DataFrame

  3. then, execute table = datatable.Frame(dataframe), it will core here.

I think it should generated a datatable.Frame() rather than core dump