SIGSEGV somhow connected to ResultSet Close?

wilkolazki commented 8 years ago

Hi,

I have created a simple rest service which for each request calls oracle a few times. Largest query fetches about 1000 rows using resultSet. The app does work for some time and after a few minutes on n-th request I get Segmentation Fault.

This is what I get from Segfault-handler module:

PID 8141 received SIGSEGV for address: 0x0
/home/swilkolazki/www/VeraRestServer/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x1cab)[0x7f6b34713cab]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf0a0)[0x7f6b38e040a0]
/home/swilkolazki/www/VeraRestServer/node_modules/oracledb/build/Debug/oracledb.node(_ZN9ResultSet11Async_CloseEP9uv_work_s+0x47)[0x7f6b36848f41]
node[0xfad1c1]
node[0xfbbdf9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50)[0x7f6b38dfbb50]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6b38b4595d]

Here is my config: Client Shared Library 64-bit - 12.1.0.2.0 oracledb: "1.6.0" $ uname -a Linux domx1st 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u5 x86_64 GNU/Linux $ node --version v4.3.1

Node is installed with nvm. Systemwide node version is $ /usr/bin/node --version v0.10.29

What should I do to debug the issue?

wilkolazki commented 8 years ago

Hi,

I have found weird workaround of this issue. I still don't know why there was a segfault.

I am fetching all rows with the function below:


function fetchAllRows(connection, resultSet, numRows, appendTo, callback) {
    resultSet.getRows(
      numRows,
      function (err, rows) {
          if (err) {
              console.trace(err);
              return callback(err);
          } else if (rows.length === 0) {
              //resultSet.close( callback ); // --this causes segfault
              resultSet.close( function(err) { callback(err); } ); //-- this does not
          } else if (rows.length > 0) {
              for (var i = 0, l = rows.length; i < l; ++i) {
                  appendTo.push(rows[i]);
              }
              fetchAllRows(connection, resultSet, numRows, appendTo, callback);
          }
      }
    )

As you can see, when there is no more rows, I am closing the resultSet. When I passed my callback directly to the close method, I was eventually getting segfault. When I wrapped the callback in anonymous function, the app works perfectly under siege for much longer time period and nothing happens.

Weird, but works.

I do not have any stripped down code example to demonstrate this issue at the moment. Let me know if you would like me to prepare one.

cjbj commented 8 years ago

@wilkolazki Looks interesting. Can you work on a testcase? Don't forget to close the resultset if getRows returns an error.

niallosul commented 8 years ago

I'm also seeing a SIGSEGV when using resultsets. I've found I can limit the frequency of the faults by setting numRows to a low number (~10) and limiting my pooled connections to around 5, but I still see the problem occasionally. With poolMax set to 50 or more, and numRows up above 40, I can get it to happen fairly consistently - around every 500 selects or so.

The same poolMax and numRows settings will occasionally also produce an NJS-032 error, but I can't tell if this is related, as it's not consistent.

I'm working on a windows machine, and connecting to a remote Oracle DB:

Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production

The fault I'm getting is is below - if you can't recreate with the test case provided by @wilkolazki, let me know and I'll try to strip my code down to something you can use

PID 9668 received SIGSEGV for address: 0xfd35965d
SymInit: Symbol-SearchPath: '.;C:\Program Files\nodejs\experiments;C:\Program Files\nodejs;C:\windows;C:\windows\system32;SRV*C:\websymbols*http://msdl.microsoft.com/download/symbols;', symOptions: 530, UserName: 'niallosul'
OS-Version: 6.1.7601 (Service Pack 1) 0x100-0x1
c:\program files\nodejs\node_modules\segfault-handler\src\stackwalker.cpp (941): StackWalker::ShowCallstack
c:\program files\nodejs\node_modules\segfault-handler\src\segfault-handler.cpp (116): segfault_handler
0000000077359ACF (ntdll): (filename not available): vsprintf_s
00000000773583F2 (ntdll): (filename not available): RtlUnwindEx
0000000077358D58 (ntdll): (filename not available): RtlRaiseException
000007FEFD35965D (KERNELBASE): (filename not available): RaiseException
f:\dd\vctools\crt\crtw32\eh\throw.cpp (154): _CxxThrowException
c:\program files\nodejs\node_modules\oracledb\src\dpi\src\dpiutils.cpp (80): ociCallCommon
c:\program files\nodejs\node_modules\oracledb\src\dpi\src\dpipoolimpl.cpp (174): PoolImpl::terminate
c:\program files\nodejs\node_modules\oracledb\src\njs\src\njspool.cpp (572): Pool::Async_Terminate
000000013FD06687 (node): (filename not available): uv_fs_poll_getpath
000000013FCF3170 (node): (filename not available): uv_once
00000001402398E7 (node): (filename not available): v8::Unlocker::~Unlocker
0000000140239A8E (node): (filename not available): v8::Unlocker::~Unlocker
00000000772359ED (kernel32): (filename not available): BaseThreadInitThunk
000000007736B371 (ntdll): (filename not available): RtlUserThreadStart

niallosul commented 8 years ago

Just realised I posted the seg fault related to the NJS-032 Error, not the Result Set Close problem, sorry for any confusion... here's what I see when my app shuts down 'silently' - it's from the ResultSet Async_Close, same as in the original post.

PID 1396 received SIGSEGV for address: 0xefc4853c
SymInit: Symbol-SearchPath: '.;C:\Program Files\nodejs\experiments;C:\Program Files\nodejs;C:\windows;C:\windows\system32;SRV*C:\websymbols*http://msdl.microsoft.com/download/symbols;', symOptions: 530, UserName: 'niallosul'
OS-Version: 6.1.7601 (Service Pack 1) 0x100-0x1
c:\program files\nodejs\node_modules\segfault-handler\src\stackwalker.cpp (941): StackWalker::ShowCallstack
c:\program files\nodejs\node_modules\segfault-handler\src\segfault-handler.cpp (116): segfault_handler
0000000077359ACF (ntdll): (filename not available): vsprintf_s
00000000773583F2 (ntdll): (filename not available): RtlUnwindEx
000000007738D348 (ntdll): (filename not available): KiUserExceptionDispatcher
c:\program files\nodejs\node_modules\oracledb\src\njs\src\njsresultset.cpp (630): ResultSet::Async_Close
000000013FA36687 (node): (filename not available): uv_fs_poll_getpath
000000013FA23170 (node): (filename not available): uv_once
000000013FF698E7 (node): (filename not available): v8::Unlocker::~Unlocker
000000013FF69A8E (node): (filename not available): v8::Unlocker::~Unlocker
00000000772359ED (kernel32): (filename not available): BaseThreadInitThunk
000000007736B371 (ntdll): (filename not available): RtlUserThreadStart

cjbj commented 8 years ago

@niallosul getting your testcase would be really helpful.

niallosul commented 8 years ago

@cjbj - testCase.txt is a script that recreates the issue I'm seeing. My apologies for the code, but it's butchered down from a couple of different modules, so it's probably borderline illegible - let me know if you need me to clean it up, add comments, etc. It's not a very real life example, but does exhibit the same behavior we're seeing in our test application.

The registrations table I'm puling from in this example typically has thousands of rows per day,and I limit it down to around 750 or so using the facility id. I'm not great with DDL, so wasn't sure how to mimic that in a test table.

I think the root of the issue could be just latency or connectivity problems, but I'm not sure how I'm missing the error. Without segfault-handler, I don't see any errors that I could trap and maybe work in a retry or something. It could also be that I'm missing a callback in the async queue I'm using, so hopefully I'm not wasting your time.

Here's the output from two consecutive runs:

First, with a small numRows:

C:\Program Files\nodejs\experiments>node testCase
numRows: 8
poolMax: 10
Made 123 requests
Got 136 rows
.
.
.
Got 1240 rows
Got 1301 rows
Extracted 114454 rows
request queue is currently empty
pool terminated

And then when I turn up the numRows and use a different date range to attempt to force the problem (it happened first try):

C:\Program Files\nodejs\experiments>node testCase
numRows: 80
poolMax: 10
Made 122 requests
Got 183 rows
Got 147 rows
Got 298 rows
Got 274 rows
Got 151 rows
Got 963 rows
Got 1209 rows
Got 1358 rows
Got 1304 rows
Got 1332 rows
Got 1308 rows
Got 1430 rows
Got 1111 rows
PID 6368 received SIGSEGV for address: 0xef54853f
SymInit: Symbol-SearchPath: '.;C:\Program Files\nodejs\experiments;C:\Program Files\nodejs;C:\windows;C:\windows\system32;SRV*C:\websymbols*http://msdl.microsoft.com/download/symbols;', symOptions: 530, UserName: 'niallosul'
OS-Version: 6.1.7601 (Service Pack 1) 0x100-0x1
c:\program files\nodejs\node_modules\segfault-handler\src\stackwalker.cpp (941): StackWalker::ShowCallstack
c:\program files\nodejs\node_modules\segfault-handler\src\segfault-handler.cpp (116): segfault_handler
00000000771A9ACF (ntdll): (filename not available): vsprintf_s
00000000771A83F2 (ntdll): (filename not available): RtlUnwindEx
00000000771DD348 (ntdll): (filename not available): KiUserExceptionDispatcher
c:\program files\nodejs\node_modules\oracledb\src\njs\src\njsresultset.cpp (630): ResultSet::Async_Close
000000013F7F6687 (node): (filename not available): uv_fs_poll_getpath
000000013F7E3170 (node): (filename not available): uv_once
000000013FD298E7 (node): (filename not available): v8::Unlocker::~Unlocker
000000013FD29A8E (node): (filename not available): v8::Unlocker::~Unlocker
00000000770859ED (kernel32): (filename not available): BaseThreadInitThunk
00000000771BB371 (ntdll): (filename not available): RtlUserThreadStart

The reason I think it's possibly latency-related is that the next time I run the exact same date range, it usually works fine (maybe because of caching on the db side):

C:\Program Files\nodejs\experiments>node testCase
numRows: 80
poolMax: 10
Made 122 requests
Got 147 rows
Got 298 rows
Got 183 rows
Got 963 rows
Got 1209 rows
Got 1304 rows
.
.
.
Got 239 rows
Extracted 104012 rows
request queue is currently empty
pool terminated

dmcghan commented 8 years ago

@niallosul The latest release of the driver includes queuing by default. Could you test with that version? Maybe try striping out your queueing code to see if the new built in queue helps?

niallosul commented 8 years ago

@dmcghan - will do, thanks for the info

dmcghan commented 8 years ago

@niallosul Sorry, I thought I followed up on my last comment but I guess I lost what I wrote...

Never mind on the queue, I see how you're using it now.

Can you run a test where you don't collect all the rows in an array in JavaScript? I wonder if this is a memory related issue. With a queue concurrency of 10 (poolMax) you're likely getting 10 very large arrays in your Node process.

Typically with a resultset you'd treat it like a stream and avoid the kind of buffering that's happening in the test. See if you can reproduce if you just increment a counter of the rows retrieved.

niallosul commented 8 years ago

@dmcghan - I just made the switch v 1.71 and got rid of the async queue I was using, and that seems to have done it. I'm seeing a couple of NJS-040 errors with bigger date ranges, but that's an easy fix

I would like to switch to a stream, but having the full set in memory (briefly) is necessary for our real app, as we have a couple of further selects to make based on what comes back from the first one (it sounds weird, but the set of subselects varies, so we can't just do it with a single select).

I think we get an array per select rather than per pool (since it's declared at the top of getData() ), but I'll definitely look at that more closely if I start seeing the SIGSEGV again.

dmcghan commented 8 years ago

Well that was lucky! :)

You might want to consider using async's eachSeries method for some more control over processing the array without getting into the complexity of the queue.

Unless the subselects are based on the entire result set, which is hard to imagine (but entirely possible), I would think you could get it into a stream.

Another option would be to rely more on PL/SQL. I know there's a tendency to want to use JavaScript for some of these things, but PL/SQL does have it's advantages. In your case the fact that all the data wouldn't have to move across the wire could be very beneficial.

I'd love it if you could explain, even with a superficial example, why all the data would need to be moved to the middle tier so that some subselects can be run. If you could do that, I could probably show you an example of how that logic could be implemented more efficiently with PL/SQL.

Have a nice weekend!

cjbj commented 8 years ago

@niallosul I've grabbed your testcase. We'll see if we can uncover any underlying issue. Regarding the NJS-040, check the pool stats and see if your pool is big enough https://github.com/oracle/node-oracledb/blob/master/doc/api.md#connpoolmonitor and https://blogs.oracle.com/opal/entry/node_oracledb_1_7_0

@wilkolazki do you have a testcase?

niallosul commented 8 years ago

@dmcghan - I'm sure you've got better things to do than a design review of my app, but just in case you're bored here's a quick explanation, with a simple example.

Basically, we need to generate csv files for a number of downstream systems who can't take data in any other way.

For simplicity's sake, we'll say that our database has 10 tables, and we're talking to 5 different ancillary systems who need various subsets of that data. One table (registrations, for example) is the primary and the other 9 tables are linked on a primary key. Ancillary A might need data from all 10 tables, but ancillaries B,C,D and E might only need data from 3 or 4 tables each. So we need to produce 5 data files, and some of the data in the files will overlap

We originally wrote each extract as standalone program, with a big SQL joining the necessary tables. This was okay because it was simple, and worked fine until we get into bigger datasets, and maybe 4 of the SQLs are hitting the same 7 tables and taking a hour each to pull data.

I was familiar with node, and thought this might be an opportunity to divide and conquer. I wrote a node app that has 10 small SQLs, one for each table. The app goes and gets all the qualifying rows from the registration table, and then asynchronously grabs all data from the other tables. Once all the selects are back, I filter the dataset into each file. This app now only pulls the amount of data that the biggest file needed, and produces all five files in basically the same amount of time that the biggest file used to take.

So that's why I need to keep the initial data load in an array and then use it to hit the other tables. I'm sure there's a better way to do it using PL/SQL, but one problem with that is that we have several databases that we're pulling data from, so wouldn't we need to build and maintain the PL/SQL in every database rather than just having one centralized app?

One thing to note - I first built this out using an external JDBC-based api to get the data into my node app, and we never had any memory issues on the node end. I'd think that the array sizes would be the same.

Anyway, I'm definitely open to suggestions on better ways to do this... I'm not sure that streaming is possible, since I need to have all the data before I can start filtering it into each file (a row in the file usually contains data from all tables), but there's probably something that smarter people than me know about

Thanks for all your help so far

dmcghan commented 8 years ago

@niallosul Sorry for the delayed response. Though I'm a little confused by some of the things you said, it sounds like you've optimized things according to your environment. Just in case you don't mind continuing the discussion...

You said:

We originally wrote each extract as standalone program, with a big SQL joining the necessary tables. This was okay because it was simple, and worked fine until we get into bigger datasets, and maybe 4 of the SQLs are hitting the same 7 tables and taking a hour each to pull data.

What was the original solution? The JDBC-based API you described later?

What do you mean by "...a big SQL joining the necessary tables..." and later "...maybe 4 of the SQLs are hitting the same 7 tables..."?

Have you considered an architecture where you have 1 specifically tuned query for each downstream system? Then you could run them in parallel (depending on the load) in separate node processes without buffering in memory.

If any of those queries were someone complex, maybe involving nested queries, I'd recommend looking into pipelined table functions: http://docs.oracle.com/cd/B28359_01/appdev.111/b28425/pipe_paral_tbl.htm#CHDJEGHC

The idea would be to do the heavy lifting in PL/SQL and pipe the results you want to send to the CSV out. This could potentially save lots of network round trips and prevent buffering in the database and Node.js.

dmcghan commented 8 years ago

@niallosul Scratch the idea on pipelined table functions... I just ran a test and they do not seem to be working as I'd anticipated (I'll ping Chris about that).

niallosul commented 8 years ago

@dmcghan - no worries at all on the delayed response, and thanks for taking the time to look at this.

The original solution was to use our proprietary middleware to generate the files - not worth getting into here, but it's an old SQL-like language that lets you do queries and report/file writing in one place.

The JDBC-based API is what I used when I originally started looking at using node for this. Since we don't own that API, I wanted to remove the dependency and use node-oracledb for the actual queries instead.

Your suggestion to have one specifically tuned query per system is actually what we're trying to get away from - that was the 'big SQL joining the necessary tables' to produce each file. What's been happening with that approach is that we're writing one query per file, but we have hundreds of files to generate, so all the different queries are getting difficult to manage.

Here's a more generalized question that would be great to have an answer to:

Is it better to have hundreds of individual queries pulling data from 10-20 tables, or one query per table with some middleware logic to filter and map the output into hundreds of files?

The latter is what we're working towards, since the individual queries are reusable for different 'main queries', and we can use node to asynchronously hit all the joined tables and pull a superset of data before we start creating files. However, node may not be the best option for the CPU-intensive processing of those large data sets.

Thanks again for any input

cjbj commented 8 years ago

@niallosul I think the answer is: 'it depends'.

One thing that does stand out is that the various deployment scenarios sound like potential pain points. Maybe some new tooling would help?

cjbj commented 8 years ago

It seems that there is a class of seg fault caused by the garbage collector closing resources which should still be in use. We are looking at some changes in the JavaScript work for 1.9 that should help out.

cjbj commented 8 years ago

@wilkolazki @niallosul can you check with node-oracledb 1.9.2 (without any workarounds) and let us know if you still see crashes?

wilkolazki commented 8 years ago

@cjbj, I have just compiled the 1.9.2 driver and I run my app as it was at the moment of reporting this issue. The app is running for some time now without any segfaults.

The new version fixed the problem I had.

Thank you!

I am sorry for no response for so long time. Unfortunatelly I was moved from this app to other projects and could not report back.

niallosul commented 8 years ago

@cjbj - looks good for me so far too... Thanks!

cjbj commented 8 years ago

@wilkolazki @niallosul thanks for checking.

@wilkolazki I hope you get back to Node soon!

oracle / node-oracledb

SIGSEGV somhow connected to ResultSet Close? #368