Improve fetch performance by fetching more rows at a time

ricsmania commented 10 years ago

I have been comparing go-oci8 to other drivers written in Go and other languages. I have executed a simple test where I query and fetch all the records (rows.Next()) from a single table with 32 columns, 2 million rows and an average row size of 58 bytes. The application and database are on the same Windows machine. Here are the results, from fastest to slowest:

PL/SQL - 4.8 seconds .Net (System.Data.OracleClient) - 6 seconds go-oci8 - 18 seconds goracle - 33 seconds Delphi (DOA) - 58 seconds

go-oci8 has good performance compared to goracle and Delphi, however it is around 3x slower than .Net.

I started investigating and the most obvious difference is that in go-oci8 when OCIStmtFetch is called the nrows parameter is always 1:

func (rc *OCI8Rows) Next(dest []driver.Value) error {
    rv := C.OCIStmtFetch(
        (*C.OCIStmt)(rc.s.s),
        (*C.OCIError)(rc.s.c.err),
        1, // nrows
        C.OCI_FETCH_NEXT,
        C.OCI_DEFAULT)

.Net, however, has a variable amount of rows:

rc = TracedNativeMethods.OCIStmtFetch(
                                        _statementHandle,           // stmtp
                                        ErrorHandle,                // errhp
                                        _rowsToPrefetch,            // crows
                                        OCI.FETCH.OCI_FETCH_NEXT,   // orientation
                                        OCI.MODE.OCI_DEFAULT        // mode
                                        );

From what I've seen from the code, .Net fetches around 64 kb of data each time, which in my case would mean around 1100 rows.

I'm trying to make go-oci8 fetch more rows, however when I do that errors start appearing everywhere, especially on the calls to close the cursors.

Before I dig any deeper I was wondering if you have already tried anything similar, and maybe have some idea on what would be the parts that need to be modified in order to do that.

mattn commented 10 years ago

Oh! Great catch!

Could you please send me a pull-requset?

lunny commented 9 years ago

+1

DunesUnLimited commented 8 years ago

The default array prefetch size as set bu go-oci8 is 10 rows, By setting the environment variable PREFETCH_ROWS you can raise that size. I do that from an init() func. if env := os.Getenv("PREFETCH_ROWS"); env == "" { os.Setenv("PREFETCH_ROWS", "128") }

A word of warning. Typically people are using too extreme values. Array fetching reduces overhead as common tasks need only be done once for each fetch. When you fetch 100 or 1000 rows at a time the overhead has been reduced to 1 percent or less of what it was. And typically there is not much to be gained when going bigger.

Lercher commented 6 years ago

Probably the default value of 10 is to low. For my particular set of queries (thousands of small rows consisting of column names plus comments) over a 16MBit DSL VPN the default values use less than 100kbit/s of bandwidth and starting with 256 it is used fully: the run time drops from over 2min to under 8s.

The problem is, that I found this parameter only by accident. Well, in fact I guessed that the slow performance probably was due to excessive handshaking, so I had a look at the source code.

I propose this change in oci8.go:

    // set safe defaults
    dsn.prefetch_rows = 10 ---> 256, but not more than 1024

mattn commented 6 years ago

Current implementation can change this value with query parameter scott/tiger@host?prefetch_rows=256. Do you mean we've better to change default value.

Lercher commented 6 years ago

Yes, that‘s what I suggested, increase the default value.

I guess it‘s in the spirit of golang: provide good defaults. However, I can‘t judge the impact of such a change but I hope you can, mattn.

mattn commented 6 years ago

@djadala How do you think?

djadala commented 6 years ago

Hi, Im don't sure my opinion is right, but i think current default is ok, using databases most time is consumed for queries, not for retrieving results. And this default can be changed by query parameter. Regards

Lercher commented 6 years ago

May I kindly ask, what is meant by distinguishing „consuming query“ and „retrieving results“? In my understanding, this is exactly the same thing.

djadala commented 6 years ago

Hi, query is executed on server, and then results are sent to client (over network), for typical sql query with small number of (result) rows, executing on server consume more time that sending these results to client. Regards.

DunesUnLimited commented 6 years ago

Unless you are fetching over a high latency network, for instance across the atlantic. That is slow. For client server communication typically the latency is the problem not the bandwidth.

On 03-04-18 13:30, Jamil Djadala wrote:

Hi, query is executed on server, and then results are sent to client (over network), for typical sql query with small number of (result) rows, executing on server consume more time that sending these results to client. Regards.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mattn/go-oci8/issues/38#issuecomment-378218102, or mute the thread https://github.com/notifications/unsubscribe-auth/AHFQAGDNQFkM8jpYy728Tfj7xmUWeqIhks5tk11RgaJpZM4Ck_qh.

Lercher commented 6 years ago

Here are some benchmarks, varying only the prefetch_rows parameter from 10 up to 2560. Low latency LAN:

BenchmarkPrefetch10-4                  1        3616897100 ns/op
BenchmarkPrefetch20-4                  1        2719893500 ns/op
BenchmarkPrefetch40-4                  1        2667631800 ns/op
BenchmarkPrefetch80-4                  1        2394983900 ns/op
BenchmarkPrefetch160-4                 1        2392999600 ns/op
BenchmarkPrefetch320-4                 1        2238717200 ns/op
BenchmarkPrefetch640-4                 1        2192580100 ns/op
BenchmarkPrefetch1280-4                1        2148700900 ns/op
BenchmarkPrefetch2560-4                1        2165855600 ns/op

Via 30ms-ping VPN to the same server:

BenchmarkPrefetch10-4                  1        132144382800 ns/op
BenchmarkPrefetch20-4                  1        67977726500 ns/op
BenchmarkPrefetch40-4                  1        37118190000 ns/op
BenchmarkPrefetch80-4                  1        20707414800 ns/op
BenchmarkPrefetch160-4                 1        12108322400 ns/op
BenchmarkPrefetch320-4                 1        8150374500 ns/op
BenchmarkPrefetch640-4                 1        5853073900 ns/op
BenchmarkPrefetch1280-4                1        4737049800 ns/op
BenchmarkPrefetch2560-4                1        4209097100 ns/op

The underlying function issues 4 queries with a total of about 50k small records from the same server from two different locations and different client hardware. So don't compare the absolute values. Only the sequence within a group has a proper meaning.

My recommendation to my fellow colleagues is: unless RAM is a scarce resource and and unless for rows.Next() {...} loops are terminated early, the prefetch_rows value should be increased, even in low latency networks. However, of course, it's always preferable to put the client code near the database server, but sometimes the software system is forced to honor external constraints that make such a design impossible.

Off Topic: Further investigations showed that the official .Net native client suffers from similar "chatty" performance problems over low latency networks as well. There the default value is 128kB of prefetch memory (OracleCommand.FetchSize). Increasing this to 4MB in the test case has a well noticeable effect of quadrupling the used bandwidth percentage.

cjbj commented 6 years ago

In other drivers we've chosen 100 as the default prefetch or array fetch size. In ODPI-C (used by cx_Oracle, node-oracledb etc) we have a fixed prefetch of 2 and then a user-settable array fetch size, with default 100. This is mostly arbitrary but definitely better than 10 for many queries. Generally you don't want to oversize the value used in applications - if you have queries fetching single rows, then reduce the value.

mattn commented 5 years ago

@Lercher do you still have opinion?

Lercher commented 5 years ago

After reviewing the comments: My opinion didn‘t change: The parameter needs to be documented better, so that its impact has better visibility and it needs to get a better default value.

So, just do as you please.

MichaelS11 commented 5 years ago

Added benchmarks for prefetch: https://github.com/mattn/go-oci8/pull/289

In my option the prefetch rows should default to a higher number, say 1000 or so. The more important setting in my option is the prefetch_memory. This way if you have lots of little rows or a small number of large rows you are pulling around the same amount of data across the wire.

Keep in mind there is no way to have numbers that work in all cases. Also think the common use case is where the database is close (on the same local network) as the client, so the defaults should be selected with that in mind.

Below are some benchmark tests for prefetch rows set to 1000 and prefetch memory going from 32768 down to 2048, over three tries. The table that is being selected from is a single row table with an integer.

BenchmarkPrefetchR1000M32768-8          2000000000               0.05 ns/op
BenchmarkPrefetchR1000M16384-8          2000000000               0.05 ns/op
BenchmarkPrefetchR1000M8192-8           1000000000               0.22 ns/op
BenchmarkPrefetchR1000M4096-8           2000000000               0.13 ns/op
BenchmarkPrefetchR1000M2048-8           2000000000               0.24 ns/op

BenchmarkPrefetchR1000M32768-8          2000000000               0.05 ns/op
BenchmarkPrefetchR1000M16384-8          2000000000               0.05 ns/op
BenchmarkPrefetchR1000M8192-8           1000000000               0.22 ns/op
BenchmarkPrefetchR1000M4096-8           2000000000               0.13 ns/op
BenchmarkPrefetchR1000M2048-8           2000000000               0.38 ns/op

BenchmarkPrefetchR1000M32768-8          2000000000               0.04 ns/op
BenchmarkPrefetchR1000M16384-8          2000000000               0.07 ns/op
BenchmarkPrefetchR1000M8192-8           2000000000               0.07 ns/op
BenchmarkPrefetchR1000M4096-8           2000000000               0.19 ns/op
BenchmarkPrefetchR1000M2048-8           1000000000               0.47 ns/op

I think having the prefetch rows set to 1000 and prefetch memory set to 8192 might work well for defaults. The prefetch memory could go a little high but I would be mindful of setting it too high.

Lercher commented 5 years ago

0.05ns/op = 20 Gops/s, that‘s weird on PC Hardware because that means that you are measuring a single cpu operation going at 4GHz with a full 5 Level pipeline. Are you sure that you are measuring a Database access? A single network roundtrip in a GBit LAN is on the order of microseconds, routed hundreds of u-seconds and on the internet some ten to hundred milliseconds. So I guess there was no LAN and no database involved in the benchmark.

However, I agree on the LAN argument that its optimal values should rule the defaults. It‘s just my special case that I only have a WAN to access the database, so it is probably sufficient to have a prominent notice in the docs that the discussed parameter has to be selected carefully in a high latency network.

MichaelS11 commented 5 years ago

There was a LAN and database involved. Welcome to check out the code. You are correct though, there is an issue with the benchmarks and they are not valid.

I think there is some kind of local cache. I looked at OCI statement cache and object cache but both need to have OCI environment initialized in object mode which is not being done. Are there any other OCI caches?

cjbj commented 5 years ago

@MichaelS11 I'd be uncomfortable to see the default as 1000. I know a lot of Oracle-land does single-row fetches. I'd go with 100 and improve documentation.

MichaelS11 commented 5 years ago

Going to look at updating documentation for prefetch rows and memory very soon.

As for default prefetch rows set to 1000, if the prefetch memory is set to 0 (unlimited), then 1000 is way to much. I would even say that even one row could be too much without having a prefetch memory limit as well. Looking at 100 prefetch rows and unlimited prefetch memory, that could easily be a megabyte or more with only having a handful of larger VARCHAR2 rows.

I am suggesting doing 1000 default prefetch rows with a prefetch memory limit. Anywhere between 1K to 16K prefetch memory limit seems good to me. Was thinking 8K at first, but maybe starting with 1K and seeing how that goes would be a good idea?

MichaelS11 commented 5 years ago

Looking at the benchmark tests again, I wonder if they are correct after all. I updated them (https://github.com/mattn/go-oci8/pull/291) and added more data to pull. Ran them with: -bench=BenchmarkPrefetch -benchmem -benchtime 10s -count 3

BenchmarkPrefetchR1000M32768-8          10000000000              0.01 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M32768-8          10000000000              0.01 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M32768-8          10000000000              0.01 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M16384-8          10000000000              0.01 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M16384-8          10000000000              0.01 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M16384-8          10000000000              0.01 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M8192-8           10000000000              0.01 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M8192-8           10000000000              0.01 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M8192-8           10000000000              0.01 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M4096-8           20000000000              0.02 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M4096-8           20000000000              0.01 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M4096-8           20000000000              0.02 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M2048-8           2000000000               0.37 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M2048-8           3000000000               0.16 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M2048-8           1000000000               0.47 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M1024-8           3000000000               0.51 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M1024-8           1000000000               1.43 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M1024-8           3000000000               0.47 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M512-8            2000000000               0.84 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M512-8            3000000000               0.58 ns/op            0 B/op          0 allocs/op
BenchmarkPrefetchR1000M512-8            2000000000               0.82 ns/op            0 B/op          0 allocs/op

While running them I looked at the network usage and noticed it was running around 7 Mbps. As the test prefetch got smaller, it slowed down to around 750 Kbps. The prefetch may just be prefetching more rows as it gets rows. If this was the case, the network latency would only really slow things down during the first set of rows.

Doing some calculations the amount of data per each full select is 625 Kb. It takes around 108ms per each full select. That seems to be around 6 Mbps, if I am not mistaken?

Lercher commented 5 years ago

I‘m awfully sorry. I guess your BM code is missing the central for loop over b.N (see https://golang.org/pkg/testing/ for an example), that’s why the figures shown are either plainly wrong, or I overlooked something important. Anyway, 7Mbit/s would be an awful transfer rate for a query which’s base table fits completely in RDBMS RAM. When queried over a standard client 1GBit LAN it should be somewhere at 70Mbyte!/s maybe only 35 but not around 1% of the max net transfer rate,of the LAN.

MichaelS11 commented 5 years ago

The benchmarks are not in a b.N loop. Do not think that is required, is it? That just quits the loop sooner so that the time matches closer to the benchtime set. Since the test is shorter than the requested 3 seconds, it calls the benchmark function (like BenchmarkPrefetchR1000M32768 for example) again and again till it fills up the requested amount of time.

The higher the prefetch memory limit is set, the more network bandwidth the query would take and the faster the data would be returned. I think the limit here is not the network bandwidth but the network latency. Pulling a full table select of 625 Kb is around 108 ms seems pretty good to me.

I want people to look at the benchmark code and make sure it is correct. But I also would like people to run it themselves and if it is not correct provide corrections. I also encourage you to write a benchmark test yourself if you think their is a better one.

MichaelS11 commented 5 years ago

Looks like b.N loop is required to get good numbers for x/op. Updated code and also added a prefetch rows 10 and prefetch memory 0 (unlimited) as well. https://github.com/mattn/go-oci8/pull/293 Here are the new results:

BenchmarkPrefetchR1000M32768-8           1000000              4664 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M32768-8           1000000              4898 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M32768-8           1000000              4831 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M16384-8           1000000              5871 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M16384-8           1000000              5530 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M16384-8           1000000              5753 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M8192-8             500000             10443 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M8192-8            1000000             10471 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M8192-8             500000             10218 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M4096-8             300000             15382 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M4096-8             200000             17582 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M4096-8             300000             15520 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M2048-8             200000             33727 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M2048-8             100000             34934 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M2048-8             200000             34245 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M1024-8             100000             46573 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M1024-8             100000             59084 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M1024-8              50000             72257 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M512-8               50000             96822 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M512-8               30000            160117 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR1000M512-8               30000            147686 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR10M0-8                   20000            306209 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR10M0-8                   10000            335251 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR10M0-8                   20000            195982 ns/op              72 B/op          4 allocs/op

Numbers look much better.

Note that the network usage stayed the same.

MichaelS11 commented 5 years ago

So should I create a PR for default 1000 prefetch rows and 4096 prefetch memory limit? Might be a good middle of the road?

MichaelS11 commented 5 years ago

Updated documentation for prefetch rows and memory: https://github.com/mattn/go-oci8/pull/294

mattn commented 5 years ago

Does everyone become happy with # 294? If so, I'll merge it.

cjbj commented 5 years ago

I'd still prefer a smaller default row size, which may be nicer to the DB in the common cases :)

MichaelS11 commented 5 years ago

@cjbj Would it be possible to get details why you think this and what you think the common case is?

cjbj commented 5 years ago

The common case is a 'how long is a piece of string' question. At one end, we do know that a lot of queries fetch just one row. At the other end, it's hard to tell. We've been using 100 for other drivers as a happy medium. Hopefully users who are interested in tuning will read the doc about how to change the default.

MichaelS11 commented 5 years ago

@cjbj The prefetch rows and prefetch memory work together. So doing a prefetch rows of 100 with prefetch memory of unlimited (0) can be really bad for the database and network because it could preload a very large amount of data.

So would really like to understand why you think, with details, prefetch rows set to 1000 and prefetch memory set to 4096 is not a good idea? What would you set prefetch rows and prefetch memory to and why?

cjbj commented 5 years ago

No choice will work for everyone. Since the options are tunable, it is up to the user to choose the best settings. I'd keep them closer to the lower end to avoid the large amounts of memory you mention.

Some discussion: (i) The Oracle Real World performance team prefers using memory limits, since that lets network buffer sizes be consistent (ii) PHP OCI8 was once bitten by an OCI bug when both mem & row limits were set and the query was over a dblink (bug 6039623, fixed in 12.1) so I've been biased against setting both concurrently (iii) memory limits generally don't mean a lot to users (iv) the user interface complexity of having two controls is complex. (v) prefetching doesn't work when LOBs are involved - https://github.com/rsim/oracle-enhanced/issues/1755 (vii) https://github.com/oracle/odpi/issues/73

Overall for ODPI-C (used by Python cx_Oracle, Node.js node-oracledb, Go goracle etc) we went with with a 'row' limit control (which controls array fetch sizes, not pre-fetching size).

MichaelS11 commented 5 years ago

I think setting prefetch rows to unlimited (0) and prefetch memory to something like 4096 would be alright as well.

BenchmarkPrefetchR0M4096-8        200000             15707 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M4096-8        200000             20033 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M4096-8        200000             15569 ns/op              72 B/op          4 allocs/op

Should we go with default prefetch rows unlimited (0) and prefetch memory 4096?

MichaelS11 commented 5 years ago

Changed PR to prefetch rows 0 and prefetch memory 4096.

BenchmarkPrefetchR0M32768-8      1000000              4033 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M32768-8      1000000              3993 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M32768-8      1000000              4111 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M16384-8      1000000              6049 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M16384-8      1000000              6058 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M16384-8      1000000              6006 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M8192-8        500000             10116 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M8192-8        500000              9283 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M8192-8        500000              9580 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M4096-8        200000             17424 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M4096-8        200000             17014 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M4096-8        200000             17737 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M2048-8        200000             24226 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M2048-8        100000             38086 ns/op              72 B/op          4 allocs/op
BenchmarkPrefetchR0M2048-8        100000             35386 ns/op              72 B/op          4 allocs/op

MichaelS11 commented 5 years ago

So is this good to go?

mattn commented 5 years ago

I merged https://github.com/mattn/go-oci8/pull/302

If you have an issue, please file new issue.

johan-boule commented 3 years ago

Don't forget to add the same options (with good memory default) for output cursors, otherwise it's one round-trip per row again :/

cjbj commented 3 years ago

Seeing this active, note that since the last discussion above, ODPI-C (and drivers on top of it) now also let the prefetch row value be altered. This can be used to reduce round-trips when the number of rows to be fetched is known which helps app scalability. See https://cx-oracle.readthedocs.io/en/latest/user_guide/tuning.html#choosing-values-for-arraysize-and-prefetchrows for some discussion.

mattn / go-oci8

Improve fetch performance by fetching more rows at a time #38