pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

11.0 problems with hd5store table remove method #3396

Closed jonblunt closed 11 years ago

jonblunt commented 11 years ago

I have a hd5store that accumulates data by batch. However the batches overlap and I want to just add the new rows..
The data is in a table with a major index of event timestamp. I can assume the new batch has complete data for the first day so all I need to do is to remove all rows with a timestamp >= min_new_date.

I am using python 2.7, pandas 0.11.0.dev-b41dc91 on Windows 7 64bit and lget the following error however I tried to pass in the date In this example I hard code the date string to simplify the example.

dstore = pyt.HDFStore(hdfile)

dstore Out[39]: <class 'pandas.io.pytables.HDFStore'> File path: Y:\HAVI_MONITORING\Log_Scan_Files\ROPLogHistComp2.h5 /files series (shape->[1])
/logdata frame_table (typ->appendable_multi,nrows->1148964,ncols->13,indexers->[index],dc->[FileId,Start,Wob,Start_date,Type_1]) /meta frame (shape->[1,1])
/stores frame_table (typ->appendable,nrows->1123,ncols->20,indexers->[index])

dstore.remove('logdata','major_axis>=20130327')

Exception Traceback (most recent call last)

in () ----> 1 dstore.remove('logdata','major_axis>=20130327') C:\Python27\lib\site-packages\pandas\io\pytables.pyc in remove(self, key, where, start, stop) 546 if not s.is_table: 547 raise Exception('can only remove with where on objects written as tables') --> 548 return s.delete(where = where, start=start, stop=stop) 549 550 def append(self, key, value, columns=None, **kwargs): C:\Python27\lib\site-packages\pandas\io\pytables.pyc in delete(self, where, *_kwargs) 2748 # create the selection 2749 table = self.table -> 2750 self.selection = Selection(self, where, *_kwargs) 2751 values = self.selection.select_coords() 2752 C:\Python27\lib\site-packages\pandas\io\pytables.pyc in **init**(self, table, where, start, stop, **kwargs) 3275 self.coordinates = where.values 3276 else: -> 3277 self.terms = self.generate(where) 3278 3279 # create the numexpr & the filter C:\Python27\lib\site-packages\pandas\io\pytables.pyc in generate(self, where) 3304 3305 queryables = self.table.queryables() -> 3306 return [Term(c, queryables=queryables) for c in where] 3307 3308 def select(self): C:\Python27\lib\site-packages\pandas\io\pytables.pyc in **init**(self, field, op, value, queryables) 3132 3133 if len(self.q): -> 3134 self.eval() 3135 3136 def **str**(self): C:\Python27\lib\site-packages\pandas\io\pytables.pyc in eval(self) 3158 3159 if not self.is_valid: -> 3160 raise Exception("query term is not valid [%s]" % str(self)) 3161 3162 # convert values if we are in the table Exception: query term is not valid [field->major_axis,op->>=,value->['20130327']] C:\Python27\lib\site-packages\pandas\io\pytables.pyc in **init**(self, table, where, start, stop, **kwargs) 3275 self.coordinates = where.values 3276 else: -> 3277 self.terms = self.generate(where) 3278 3279 # create the numexpr & the filter C:\Python27\lib\site-packages\pandas\io\pytables.pyc in generate(self, where) 3304 3305 queryables = self.table.queryables() -> 3306 return [Term(c, queryables=queryables) for c in where] 3307 3308 def select(self): C:\Python27\lib\site-packages\pandas\io\pytables.pyc in **init**(self, field, op, value, queryables) 3132 3133 if len(self.q): -> 3134 self.eval() 3135 3136 def **str**(self): C:\Python27\lib\site-packages\pandas\io\pytables.pyc in eval(self) 3158 3159 if not self.is_valid: -> 3160 raise Exception("query term is not valid [%s]" % str(self)) 3161 3162 # convert values if we are in the table Exception: query term is not valid [field->major_axis,op->>=,value->['20130327']]
jreback commented 11 years ago

use index instead of major_axis (that example is a panel)

eg 'index>=20120101'

u can also do

Term('index','>',date)

where date is a date like object ( or string)

just so u know deleting is not fast

and every once in a while run ptrepack on the file

jreback commented 11 years ago

a faster way might be to accumulate your rows for say today in a new table (just append them) then say at the end of the day read today's table in (which will only read the last of duplicate rows, and hence is slower when dups r there), and then append in one shot to your original table

but should be faster than delete/ append off of main table

jonblunt commented 11 years ago

Got the remove to work. Thanks

Batches are several days, usually eight days day received every seven days. The data set now covers several years of activity. The collecting process is not completely standardized and, unfortunately, it is possible some of the last rows in the batch represent partial work. Those will be trued up in the next transmission. Hence my approach.

I will now think about how to stop partial records getting loaded in the first place.

jb

.

jreback commented 11 years ago

great

jreback commented 11 years ago

ok 2 close?

jonblunt commented 11 years ago

Yes .

Thanks

On Thu, Apr 18, 2013 at 7:53 PM, jreback notifications@github.com wrote:

ok 2 close?

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/3396#issuecomment-16622049 .