Iterator to execute query without storing complete result in memory

rammpeter commented 8 years ago

Hi Raimonds,

it would be great if there is a possibility to expand ActiveRecord::ConnectionAdapters::OracleEnhancedAdapter with a method similar to exec_query, which returns an iterator instead of result array.

This way one will be able to process large results record by record without storing the complete result in memory. This works proper for large results normally ending in "out of memory" if processed by exec_query.

I use the following implementation of class SqlSelectIterator in production, may be it can fit in one of your next version.

Best regards Peter

p.s.: It would also be nice if query timeout be "statement.setQueryTimeout" can be set as high level option.

# hold open SQL-Cursor and iterate over SQL-result without storing whole result in Array
# Peter Ramm, 02.03.2016

# expand class by getter to allow access on internal variable @raw_statement
ActiveRecord::ConnectionAdapters::OracleEnhancedJDBCConnection::Cursor.class_eval do
  def get_raw_statement
    @raw_statement
  end
end

# Class extension by Module-Declaration : module ActiveRecord, module ConnectionAdapters, module OracleEnhancedDatabaseStatements
# does not work as Engine with Winstone application server, therefore hard manipulation of class ActiveRecord::ConnectionAdapters::OracleEnhancedAdapter
# and extension with method iterate_query

ActiveRecord::ConnectionAdapters::OracleEnhancedAdapter.class_eval do

  # Method comparable with ActiveRecord::ConnectionAdapters::OracleEnhancedDatabaseStatements.exec_query,
  # but without storing whole result in memory
  def iterate_query(sql, name = 'SQL', binds = [], modifier = nil, query_timeout = nil, &block)
    type_casted_binds = binds.map { |col, val|
      [col, type_cast(val, col)]
    }
    log(sql, name, type_casted_binds) do
      cursor = nil
      cached = false
      if without_prepared_statement?(binds)
        cursor = @connection.prepare(sql)
      else
        unless @statements.key? sql
          @statements[sql] = @connection.prepare(sql)
        end

        cursor = @statements[sql]

        binds.each_with_index do |bind, i|
          col, val = bind
          cursor.bind_param(i + 1, type_cast(val, col), col)
        end

        cached = true
      end

      cursor.get_raw_statement.setQueryTimeout(query_timeout) if query_timeout

      cursor.exec

      if name == 'EXPLAIN' and sql =~ /^EXPLAIN/
        res = true
      else
        columns = cursor.get_col_names.map do |col_name|
          @connection.oracle_downcase(col_name).freeze
        end
        fetch_options = {:get_lob_value => (name != 'Writable Large Object')}
        while row = cursor.fetch(fetch_options)
          result_hash = {}
          columns.each_index do |index|
            result_hash[columns[index]] = row[index]
            row[index] = row[index].strip if row[index].class == String   # Remove possible 0x00 at end of string, this leads to error in Internet Explorer
          end
          result_hash.extend SelectHashHelper
          modifier.call(result_hash)  unless modifier.nil?
          yield result_hash
        end
      end

      cursor.close unless cached
      nil
    end
  end #iterate_query

end #class_eval

class SqlSelectIterator

  def initialize(stmt, binds, modifier, query_timeout)
    @stmt           = stmt
    @binds          = binds
    @modifier       = modifier              # proc for modifikation of record
    @query_timeout  = query_timeout
  end

  def each(&block)
    # Execute SQL and call block for every record of result
    ConnectionHolder.connection.iterate_query(@stmt, 'sql_select_iterator', @binds, @modifier, @query_timeout, &block)
  end

end

yahonda commented 8 years ago

Thanks for providing a code, would you consider to open this change as a pull request including some tests.

ekr1 commented 8 years ago

+1 for this change. I just had a case where my code wanted to query about 20 million rows of about 5 columns; the process took about 32GB (!) of RAM before being killed by ulimit. ;) @rammpeter, I'm using your monkey patch now, but can you please upload your change as pull request, as well?

duffyjp commented 8 years ago

I stumbled on this thread looking for a solution to a similar problem. I have a view I'm trying to import with 111 million rows. Obviously I can't fit that in memory on even the beefiest of machines. I managed to use 96GB ram and 47GB swap before the server just gave up. My server guy came in and said, _"That's all there is, and there ain't no more."_

Originally I wanted to use in_batches to process 1000 or so at a time. I couldn't however as my view doesn't have a primary ID.

Is there a way to retrofit in_batches to use ROWNUM instead? It seems like that would be the more natural fit on Oracle in the first place.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mlh758 commented 6 years ago

This is pretty old, but you can use Model.connection.raw_connection.exec(sql, *binds) which returns a cursor.

From there it's just fetch_hash until the cursor is empty.

rsim / oracle-enhanced

Iterator to execute query without storing complete result in memory #785