hanwsf commented 4 years ago

df=vaex.open(DIR+'train_0.hdf5') df1=vaex.open(DIR+'train_1.hdf5') df3=vaex.concat([df,df1]) df3.head()

or： from glob import glob DIR ='./train/' files = glob(DIR+'train*.hdf5') df = vaex.open(DIR+'train*.hdf5')

These 2 cases can cause following issue from 2.1.0 to latest version (2.0.2 is trim issue):

ValueError Traceback (most recent call last) ~/.local/lib/python3.6/site-packages/IPython/core/formatters.py in call(self, obj, include, exclude) 968 969 if method is not None: --> 970 return method(include=include, exclude=exclude) 971 return None 972 else:

~/.local/lib/python3.6/site-packages/vaex/dataframe.py in _reprmimebundle(self, include, exclude, **kwargs) 3615 # TODO: optimize, since we use the same data in both versions 3616 # TODO: include latex version -> 3617 return {'text/html':self._head_and_tail_table(format='html'), 'text/plain': self._head_and_tail_table(format='plain')} 3618 3619 def _reprhtml(self):

~/.local/lib/python3.6/site-packages/vaex/dataframe.py in _head_and_tail_table(self, n, format) 3408 N = _len(self) 3409 if N <= n * 2: -> 3410 return self._as_table(0, N, format=format) 3411 else: 3412 return self._as_table(0, n, N - n, N, format=format)

~/.local/lib/python3.6/site-packages/vaex/dataframe.py in _as_table(self, i1, i2, j1, j2, format) 3540 # parts += [""] 3541 # return values_list -> 3542 parts = table_part(i1, i2, parts) 3543 if j1 is not None and j2 is not None: 3544 values_list[0][1].append('...')

~/.local/lib/python3.6/site-packages/vaex/dataframe.py in table_part(k1, k2, parts) 3518 # slicing will invoke .extract which will make the evaluation 3519 # much quicker -> 3520 df = self[k1:k2] 3521 for i, name in enumerate(column_names): 3522 try:

~/.local/lib/python3.6/site-packages/vaex/dataframe.py in getitem(self, item) 4552 assert stop != -1 4553 stop = stop+1 # +1 to make it inclusive -> 4554 df = self.trim() 4555 df.set_active_range(start, stop) 4556 return df.trim()

~/.local/lib/python3.6/site-packages/vaex/dataframe.py in trim(self, inplace) 3782 :rtype: DataFrame 3783 ''' -> 3784 df = self if inplace else self.copy() 3785 if self._index_start == 0 and self._index_end == self._length_original: 3786 return df

~/.local/lib/python3.6/site-packages/vaex/dataframe.py in copy(self, column_names, virtual) 4852 column = self.columns[name] 4853 if not isinstance(column, ColumnSparse): -> 4854 df.add_column(name, column, dtype=self._dtypes_override.get(name)) 4855 elif name in self.virtual_columns: 4856 if virtual: # TODO: check if the ast is cached

~/.local/lib/python3.6/site-packages/vaex/dataframe.py in add_column(self, name, data, dtype) 5819 # self._length_original = len(data) 5820 # self._index_end = self._length_unfiltered -> 5821 super(DataFrameArrays, self).add_column(name, data, dtype=dtype) 5822 self._length_unfiltered = int(round(self._length_original * self._active_fraction)) 5823 # self.set_active_fraction(self._active_fraction)

~/.local/lib/python3.6/site-packages/vaex/dataframe.py in add_column(self, name, f_or_array, dtype) 2929 if len(self) == len(ar): 2930 raise ValueError("Array is of length %s, while the length of the DataFrame is %s due to the filtering, the (unfiltered) length is %s." % (len(ar), len(self), self.length_unfiltered())) -> 2931 raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original())) 2932 # assert self.length_unfiltered() == len(data), "columns should be of equal length, length should be %d, while it is %d" % ( self.length_unfiltered(), len(data)) 2933 valid_name = vaex.utils.find_valid_name(name)

ValueError: array is of length 11, while the length of the DataFrame is 10

Pls fix. Thanks!

JovanVeljanoski commented 4 years ago

Hi @hanwsf

Can you please provide some more information on the files you are trying to concat? (num columns, dtypes, do they have missing values.., etc.) Did you create those hdf5 files with vaex or in some other way?

I can create nearly the same issue as yours if I mistakenly create an hdf5 file like this:

import vaex
df = vaex.from_arrays(x=[1, 2], y=['a', 'b'], z=[100, 200, 300])

hanwsf commented 4 years ago

Hi, CSV like following: building_id,meter,meter_reading,timestamp,site_id,primary_use,square_feet,year_built,floor_count,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed 46,0,15.60455607,2016-01-01 00:00:00,0,Retail,9045,2016.0,,25.0,6.0,20.0,,1019.7,0.0,0.0 74,0,12.60368103,2016-01-01 00:00:00,0,Parking,387638,1997.0,,25.0,6.0,20.0,,1019.7,0.0,0.0 93,0,15.36447786,2016-01-01 00:00:00,0,Office,33370,1982.0,,25.0,6.0,20.0,,1019.7,0.0,0.0 105,0,6.8302851600000025,2016-01-01 00:00:00,1,Education,50623,,5.0,3.8,,2.4,,1020.9,240.0,3.1 106,0,0.10979526,2016-01-01 00:00:00,1,Education,5374,,4.0,3.8,,2.4,,1020.9,240.0,3.1 107,0,51.34643040000001,2016-01-01 00:00:00,1,Education,97532,2005.0,10.0,3.8,,2.4,,1020.9,240.0,3.1 108,0,26.74985943,2016-01-01 00:00:00,1,Education,81580,1913.0,5.0,3.8,,2.4,,1020.9,240.0,3.1 109,0,23.720583000000005,2016-01-01 00:00:00,1,Education,56995,1953.0,6.0,3.8,,2.4,,1020.9,240.0,3.1 110,0,25.27351473,2016-01-01 00:00:00,1,Education,27814,2006.0,8.0,3.8,,2.4,,1020.9,240.0,3.1 111,0,49.0625952,2016-01-01 00:00:00,1,Education,118338,1909.0,7.0,3.8,,2.4,,1020.9,240.0,3.1 112,0,3.0115438800000005,2016-01-01 00:00:00,1,Education,32206,,6.0,3.8,,2.4,,1020.9,240.0,3.1 112,3,96.978,2016-01-01 00:00:00,1,Education,32206,,6.0,3.8,,2.4,,1020.9,240.0,3.1 113,0,46.79136330000001,2016-01-01 00:00:00,1,Education,100481,1958.0,9.0,3.8,,2.4,,1020.9,240.0,3.1 113,3,19.597,2016-01-01 00:00:00,1,Education,100481,1958.0,9.0,3.8,,2.4,,1020.9,240.0,3.1 114,0,95.184225,2016-01-01 00:00:00,1,Education,139683,1958.0,13.0,3.8,,2.4,,1020.9,240.0,3.1 114,3,100.0,2016-01-01 00:00:00,1,Education,139683,1958.0,13.0,3.8,,2.4,,1020.9,240.0,3.1 115,0,59.07225330000001,2016-01-01 00:00:00,1,Education,129716,1968.0,6.0,3.8,,2.4,,1020.9,240.0,3.1 116,0,20.31183,2016-01-01 00:00:00,1,Education,37265,,5.0,3.8,,2.4,,1020.9,240.0,3.1 117,0,4.7793179100000005,2016-01-01 00:00:00,1,Education,15489,2004.0,4.0,3.8,,2.4,,1020.9,240.0,3.1 117,3,19.6809,2016-01-01 00:00:00,1,Education,15489,2004.0,4.0,3.8,,2.4,,1020.9,240.0,3.1 118,0,34.35132,2016-01-01 00:00:00,1,Education,138316,1960.0,8.0,3.8,,2.4,,1020.9,240.0,3.1 119,0,64.24869240000001,2016-01-01 00:00:00,1,Education,91149,2007.0,7.0,3.8,,2.4,,1020.9,240.0,3.1 119,3,200.0,2016-01-01 00:00:00,1,Education,91149,2007.0,7.0,3.8,,2.4,,1020.9,240.0,3.1 120,0,15.314475000000002,2016-01-01 00:00:00,1,Education,68211,1976.0,7.0,3.8,,2.4,,1020.9,240.0,3.1 121,0,67.1199,2016-01-01 00:00:00,1,Education,150318,1906.0,9.0,3.8,,2.4,,1020.9,240.0,3.1 121,3,299.7290000000001,2016-01-01 00:00:00,1,Education,150318,1906.0,9.0,3.8,,2.4,,1020.9,240.0,3.1 122,0,39.36333000000001,2016-01-01 00:00:00,1,Education,83043,1991.0,6.0,3.8,,2.4,,1020.9,240.0,3.1 123,0,13.599839999999999,2016-01-01 00:00:00,1,Education,61204,1989.0,6.0,3.8,,2.4,,1020.9,240.0,3.1 124,0,2.7258300000000006,2016-01-01 00:00:00,1,Education,38319,1900.0,6.0,3.8,,2.4,,1020.9,240.0,3.1 125,0,21.95319000000001,2016-01-01 00:00:00,1,Education,16802,1995.0,6.0,3.8,,2.4,,1020.9,240.0,3.1 126,0,13.71708,2016-01-01 00:00:00,1,Education,21539,2004.0,5.0,3.8,,2.4,,1020.9,240.0,3.1 127,0,2.54369766,2016-01-01 00:00:00,1,Lodging/residential,27071,,6.0,3.8,,2.4,,1020.9,240.0,3.1 128,0,9.14472,2016-01-01 00:00:00,1,Lodging/residential,102774,1956.0,7.0,3.8,,2.4,,1020.9,240.0,3.1 129,0,9.70161,2016-01-01 00:00:00,1,Lodging/residential,102957,1968.0,7.0,3.8,,2.4,,1020.9,240.0,3.1 130,0,4.1913300000000016,2016-01-01 00:00:00,1,Lodging/residential,62893,1960.0,16.0,3.8,,2.4,,1020.9,240.0,3.1 131,0,24.20967897,2016-01-01 00:00:00,1,Lodging/residential,66661,1930.0,7.0,3.8,,2.4,,1020.9,240.0,3.1 132,0,5.1585600000000005,2016-01-01 00:00:00,1,Lodging/residential,83108,1995.0,8.0,3.8,,2.4,,1020.9,240.0,3.1 133,0,5.187869999999998,2016-01-01 00:00:00,1,Lodging/residential,64723,1960.0,8.0,3.8,,2.4,,1020.9,240.0,3.1 134,0,7.972320000000002,2016-01-01 00:00:00,1,Lodging/residential,49589,1998.0,8.0,3.8,,2.4,,1020.9,240.0,3.1 135,0,7.884390000000002,2016-01-01 00:00:00,1,Lodging/residential,66532,1967.0,10.0,3.8,,2.4,,1020.9,240.0,3.1 136,0,4.823048430000001,2016-01-01 00:00:00,1,Lodging/residential,56467,1960.0,9.0,3.8,,2.4,,1020.9,240.0,3.1 137,0,4.176675,2016-01-01 00:00:00,1,Entertainment/public assembly,64024,1967.0,6.0,3.8,,2.4,,1020.9,240.0,3.1 138,0,10.6799778,2016-01-01 00:00:00,1,Public services,118231,,6.0,3.8,,2.4,,1020.9,240.0,3.1

hdf5 created by waex both can get head() correctly, but after concat, there is the issue.

hanwsf commented 4 years ago

The head of the hdf5:

| building_id | meter | meter_reading | timestamp | site_id | primary_use | square_feet | year_built | floor_count | air_temperature | cloud_coverage | dew_temperature | precip_depth_1_hr | sea_level_pressure | wind_direction | wind_speed

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 0 | 1022 | 3 | 186.365 | 2016-04-17 17:00:00 | 10 | Education | 84346 | nan | 2 | 11.1 | 0 | -3.9 | 0 | 1023.4 | 50 | 8.8 1 | 1023 | 0 | 44.6978 | 2016-04-17 17:00:00 | 10 | Education | 87976 | nan | 2 | 11.1 | 0 | -3.9 | 0 | 1023.4 | 50 | 8.8 2 | 1024 | 0 | 3.73404 | 2016-04-17 17:00:00 | 10 | Education | 53855 | nan | 2 | 11.1 | 0 | -3.9 | 0 | 1023.4 | 50 | 8.8 3 | 1025 | 0 | 18.5386 | 2016-04-17 17:00:00 | 10 | Lodging/residential | 52034 | nan | 2 | 11.1 | 0 | -3.9 | 0 | 1023.4 | 50 | 8.8 4 | 1026 | 0 | 53.4908 | 2016-04-17 17:00:00 | 10 | Lodging/residential | 88480 | nan | 1 | 11.1 | 0 | -3.9 | 0 | 1023.4 | 50 | 8.8 5 | 1027 | 0 | 10.4783 | 2016-04-17 17:00:00 | 10 | Lodging/residential | 35465 | nan | 2 | 11.1 | 0 | -3.9 | 0 | 1023.4 | 50 | 8.8 6 | 1028 | 0 | 33.7065 | 2016-04-17 17:00:00 | 11 | Education | 81390 | nan | nan | 18.9 | nan | -1.5 | nan | 1029.1 | 260 | 2.6 7 | 1029 | 0 | 89.098 | 2016-04-17 17:00:00 | 11 | Education | 152559 | nan | nan | 18.9 | nan | -1.5 | nan | 1029.1 | 260 | 2.6 8 | 1029 | 3 | 31.9694 | 2016-04-17 17:00:00 | 11 | Education | 152559 | nan | nan | 18.9 | nan | -1.5 | nan | 1029.1 | 260 | 2.6 9 | 1030 | 0 | 24.0342 | 2016-04-17 17:00:00 | 11 | Education | 68030 | nan | nan | 18.9 | nan | -1.5 | nan | 1029.1 | 260 | 2.6

hanwsf commented 4 years ago

master_df = vaex.open_many(hdf5_list) master_df.export_hdf5('./data/train.hdf5', progress=True) This can concat all hdf5 in 1 big hdf5. Thanks.

JovanVeljanoski commented 4 years ago

Hi @hanwsf

Thanks for your data samples, it helped me to re-create the issue. You can track the progress here #531.

Also, not the issue only about the printing of the portion of a dataframe. If you simply do df.head_and_tail or if you use Jupyter just display df it should work.

Yes indeed.. if you export a concatenated dataframe into a single hdf5 file, and work will that, you should have no problems. In fact that is the recommended workflow for now.

Thanks for making us aware on this issue tho.

Cheers, J.

maartenbreddels commented 4 years ago

Should be fixed by #531

vaexio / vaex

vaex.concat issue #529

These 2 cases can cause following issue from 2.1.0 to latest version (2.0.2 is trim issue):

| building_id | meter | meter_reading | timestamp | site_id | primary_use | square_feet | year_built | floor_count | air_temperature | cloud_coverage | dew_temperature | precip_depth_1_hr | sea_level_pressure | wind_direction | wind_speed