mplourde / rpostgresql

Automatically exported from code.google.com/p/rpostgresql
0 stars 1 forks source link

dbSendQuery() will not function properly for very large datasets #54

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. One of the key purpose of dbSendQuery() is to be able to read a large result 
set in small batches so that you can fit the processing into a reasonable 
amount of memory.  This is NOT what happens with RPostgreSQL, unfortunately.  
Watching the memory use of the process it is clear that more or less the same 
memory use occurs regardless of the amount of data that fetch() retrieves at 
each gulp.  

2. Reading through the source code, dbSendQuery causes a PQexec() to be called. 
 This PostgreSQL library routine appears to read the entire result set into 
memory.  That is why the fetch size has no effect on memory use.

3. I believe that dbSendQuery should call PQsendQuery() instead.  This 
PostgreSQL library routine is specifically intended to read large result sets 
in small batches so that they can fit into memory.

4. The only workaround for this bug in the implementation of dbSendQuery is to 
generate a large number of SQL SELECT statements, each of which uses a WHERE 
clause to limit itself to reading in only a portion of the input table.  Use 
dbGetQuery() to read in each chunk, then discard the values when you read in 
the next chunk.  In my application, I need an additional preliminary step of 
reading in columns from the table to try to guess ranges for the WHERE clause 
that will return reasonable numbers of records for each batch.  Yuck!!!

What version of the product are you using? On what operating system?

Windows 7, R 2.9.2, RPostgreSQL 0.4, Eclipse IDE with StatET

Original issue reported on code.google.com by bmusi...@aptecgroup.com on 24 May 2013 at 3:53

GoogleCodeExporter commented 9 years ago
PostgreSQL has a single-row mode in which you can fetch one row of a result set 
at a time.  To invoke this, start the query asynchronously with PQsendQuery(), 
then call PQsetSingleRowMode().  Now calls to PQgetResult() will return only 
one row at a time.

Original comment by bmusi...@aptecgroup.com on 27 May 2013 at 10:59

GoogleCodeExporter commented 9 years ago
1
dbSendQuery() and fetch() combination exist just for the compatibility to the 
codes using it. This is not designed to work in the way you thought. It should 
work if you have sufficient memory. Otherwise, that is unlucky.

2, 3
dbSendQuery() submits and synchronously executes the SQL statement and have to 
return result set object. PQsendQuery() just send the SQL and let them work 
asynchronously without waiting for the result(s). There is no function to fetch 
a certain amount of rows after PQsendQuery() either. That is the result set 
batch size is not controllable, anyway. PQsetSingleRowMode appeared at version 
9.2. This is too new to use as a default behavior in a library.  

4 You may consider using OFFSET and LIMIT instead of WHERE to construct queries
to return some limited number of records you like.

Original comment by tomoa...@kenroku.kanazawa-u.ac.jp on 11 Jun 2013 at 2:46