Closed wilkolazki closed 8 years ago
Hi,
I have found weird workaround of this issue. I still don't know why there was a segfault.
I am fetching all rows with the function below:
function fetchAllRows(connection, resultSet, numRows, appendTo, callback) {
resultSet.getRows(
numRows,
function (err, rows) {
if (err) {
console.trace(err);
return callback(err);
} else if (rows.length === 0) {
//resultSet.close( callback ); // --this causes segfault
resultSet.close( function(err) { callback(err); } ); //-- this does not
} else if (rows.length > 0) {
for (var i = 0, l = rows.length; i < l; ++i) {
appendTo.push(rows[i]);
}
fetchAllRows(connection, resultSet, numRows, appendTo, callback);
}
}
)
As you can see, when there is no more rows, I am closing the resultSet. When I passed my callback directly to the close method, I was eventually getting segfault. When I wrapped the callback in anonymous function, the app works perfectly under siege for much longer time period and nothing happens.
Weird, but works.
I do not have any stripped down code example to demonstrate this issue at the moment. Let me know if you would like me to prepare one.
@wilkolazki Looks interesting. Can you work on a testcase? Don't forget to close the resultset if getRows returns an error.
I'm also seeing a SIGSEGV when using resultsets. I've found I can limit the frequency of the faults by setting numRows to a low number (~10) and limiting my pooled connections to around 5, but I still see the problem occasionally. With poolMax set to 50 or more, and numRows up above 40, I can get it to happen fairly consistently - around every 500 selects or so.
The same poolMax and numRows settings will occasionally also produce an NJS-032 error, but I can't tell if this is related, as it's not consistent.
I'm working on a windows machine, and connecting to a remote Oracle DB:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
The fault I'm getting is is below - if you can't recreate with the test case provided by @wilkolazki, let me know and I'll try to strip my code down to something you can use
PID 9668 received SIGSEGV for address: 0xfd35965d SymInit: Symbol-SearchPath: '.;C:\Program Files\nodejs\experiments;C:\Program Files\nodejs;C:\windows;C:\windows\system32;SRV*C:\websymbols*http://msdl.microsoft.com/download/symbols;', symOptions: 530, UserName: 'niallosul' OS-Version: 6.1.7601 (Service Pack 1) 0x100-0x1 c:\program files\nodejs\node_modules\segfault-handler\src\stackwalker.cpp (941): StackWalker::ShowCallstack c:\program files\nodejs\node_modules\segfault-handler\src\segfault-handler.cpp (116): segfault_handler 0000000077359ACF (ntdll): (filename not available): vsprintf_s 00000000773583F2 (ntdll): (filename not available): RtlUnwindEx 0000000077358D58 (ntdll): (filename not available): RtlRaiseException 000007FEFD35965D (KERNELBASE): (filename not available): RaiseException f:\dd\vctools\crt\crtw32\eh\throw.cpp (154): _CxxThrowException c:\program files\nodejs\node_modules\oracledb\src\dpi\src\dpiutils.cpp (80): ociCallCommon c:\program files\nodejs\node_modules\oracledb\src\dpi\src\dpipoolimpl.cpp (174): PoolImpl::terminate c:\program files\nodejs\node_modules\oracledb\src\njs\src\njspool.cpp (572): Pool::Async_Terminate 000000013FD06687 (node): (filename not available): uv_fs_poll_getpath 000000013FCF3170 (node): (filename not available): uv_once 00000001402398E7 (node): (filename not available): v8::Unlocker::~Unlocker 0000000140239A8E (node): (filename not available): v8::Unlocker::~Unlocker 00000000772359ED (kernel32): (filename not available): BaseThreadInitThunk 000000007736B371 (ntdll): (filename not available): RtlUserThreadStart
Just realised I posted the seg fault related to the NJS-032 Error, not the Result Set Close problem, sorry for any confusion... here's what I see when my app shuts down 'silently' - it's from the ResultSet Async_Close, same as in the original post.
PID 1396 received SIGSEGV for address: 0xefc4853c SymInit: Symbol-SearchPath: '.;C:\Program Files\nodejs\experiments;C:\Program Files\nodejs;C:\windows;C:\windows\system32;SRV*C:\websymbols*http://msdl.microsoft.com/download/symbols;', symOptions: 530, UserName: 'niallosul' OS-Version: 6.1.7601 (Service Pack 1) 0x100-0x1 c:\program files\nodejs\node_modules\segfault-handler\src\stackwalker.cpp (941): StackWalker::ShowCallstack c:\program files\nodejs\node_modules\segfault-handler\src\segfault-handler.cpp (116): segfault_handler 0000000077359ACF (ntdll): (filename not available): vsprintf_s 00000000773583F2 (ntdll): (filename not available): RtlUnwindEx 000000007738D348 (ntdll): (filename not available): KiUserExceptionDispatcher c:\program files\nodejs\node_modules\oracledb\src\njs\src\njsresultset.cpp (630): ResultSet::Async_Close 000000013FA36687 (node): (filename not available): uv_fs_poll_getpath 000000013FA23170 (node): (filename not available): uv_once 000000013FF698E7 (node): (filename not available): v8::Unlocker::~Unlocker 000000013FF69A8E (node): (filename not available): v8::Unlocker::~Unlocker 00000000772359ED (kernel32): (filename not available): BaseThreadInitThunk 000000007736B371 (ntdll): (filename not available): RtlUserThreadStart
@niallosul getting your testcase would be really helpful.
@cjbj - testCase.txt is a script that recreates the issue I'm seeing. My apologies for the code, but it's butchered down from a couple of different modules, so it's probably borderline illegible - let me know if you need me to clean it up, add comments, etc. It's not a very real life example, but does exhibit the same behavior we're seeing in our test application.
The registrations table I'm puling from in this example typically has thousands of rows per day,and I limit it down to around 750 or so using the facility id. I'm not great with DDL, so wasn't sure how to mimic that in a test table.
I think the root of the issue could be just latency or connectivity problems, but I'm not sure how I'm missing the error. Without segfault-handler, I don't see any errors that I could trap and maybe work in a retry or something. It could also be that I'm missing a callback in the async queue I'm using, so hopefully I'm not wasting your time.
Here's the output from two consecutive runs:
First, with a small numRows:
C:\Program Files\nodejs\experiments>node testCase numRows: 8 poolMax: 10 Made 123 requests Got 136 rows . . . Got 1240 rows Got 1301 rows Extracted 114454 rows request queue is currently empty pool terminated
And then when I turn up the numRows and use a different date range to attempt to force the problem (it happened first try):
C:\Program Files\nodejs\experiments>node testCase numRows: 80 poolMax: 10 Made 122 requests Got 183 rows Got 147 rows Got 298 rows Got 274 rows Got 151 rows Got 963 rows Got 1209 rows Got 1358 rows Got 1304 rows Got 1332 rows Got 1308 rows Got 1430 rows Got 1111 rows PID 6368 received SIGSEGV for address: 0xef54853f SymInit: Symbol-SearchPath: '.;C:\Program Files\nodejs\experiments;C:\Program Files\nodejs;C:\windows;C:\windows\system32;SRV*C:\websymbols*http://msdl.microsoft.com/download/symbols;', symOptions: 530, UserName: 'niallosul' OS-Version: 6.1.7601 (Service Pack 1) 0x100-0x1 c:\program files\nodejs\node_modules\segfault-handler\src\stackwalker.cpp (941): StackWalker::ShowCallstack c:\program files\nodejs\node_modules\segfault-handler\src\segfault-handler.cpp (116): segfault_handler 00000000771A9ACF (ntdll): (filename not available): vsprintf_s 00000000771A83F2 (ntdll): (filename not available): RtlUnwindEx 00000000771DD348 (ntdll): (filename not available): KiUserExceptionDispatcher c:\program files\nodejs\node_modules\oracledb\src\njs\src\njsresultset.cpp (630): ResultSet::Async_Close 000000013F7F6687 (node): (filename not available): uv_fs_poll_getpath 000000013F7E3170 (node): (filename not available): uv_once 000000013FD298E7 (node): (filename not available): v8::Unlocker::~Unlocker 000000013FD29A8E (node): (filename not available): v8::Unlocker::~Unlocker 00000000770859ED (kernel32): (filename not available): BaseThreadInitThunk 00000000771BB371 (ntdll): (filename not available): RtlUserThreadStart
The reason I think it's possibly latency-related is that the next time I run the exact same date range, it usually works fine (maybe because of caching on the db side):
C:\Program Files\nodejs\experiments>node testCase numRows: 80 poolMax: 10 Made 122 requests Got 147 rows Got 298 rows Got 183 rows Got 963 rows Got 1209 rows Got 1304 rows . . . Got 239 rows Extracted 104012 rows request queue is currently empty pool terminated
@niallosul The latest release of the driver includes queuing by default. Could you test with that version? Maybe try striping out your queueing code to see if the new built in queue helps?
@dmcghan - will do, thanks for the info
@niallosul Sorry, I thought I followed up on my last comment but I guess I lost what I wrote...
Never mind on the queue, I see how you're using it now.
Can you run a test where you don't collect all the rows in an array in JavaScript? I wonder if this is a memory related issue. With a queue concurrency of 10 (poolMax) you're likely getting 10 very large arrays in your Node process.
Typically with a resultset you'd treat it like a stream and avoid the kind of buffering that's happening in the test. See if you can reproduce if you just increment a counter of the rows retrieved.
@dmcghan - I just made the switch v 1.71 and got rid of the async queue I was using, and that seems to have done it. I'm seeing a couple of NJS-040 errors with bigger date ranges, but that's an easy fix
I would like to switch to a stream, but having the full set in memory (briefly) is necessary for our real app, as we have a couple of further selects to make based on what comes back from the first one (it sounds weird, but the set of subselects varies, so we can't just do it with a single select).
I think we get an array per select rather than per pool (since it's declared at the top of getData() ), but I'll definitely look at that more closely if I start seeing the SIGSEGV again.
Well that was lucky! :)
You might want to consider using async's eachSeries method for some more control over processing the array without getting into the complexity of the queue.
Unless the subselects are based on the entire result set, which is hard to imagine (but entirely possible), I would think you could get it into a stream.
Another option would be to rely more on PL/SQL. I know there's a tendency to want to use JavaScript for some of these things, but PL/SQL does have it's advantages. In your case the fact that all the data wouldn't have to move across the wire could be very beneficial.
I'd love it if you could explain, even with a superficial example, why all the data would need to be moved to the middle tier so that some subselects can be run. If you could do that, I could probably show you an example of how that logic could be implemented more efficiently with PL/SQL.
Have a nice weekend!
@niallosul I've grabbed your testcase. We'll see if we can uncover any underlying issue. Regarding the NJS-040, check the pool stats and see if your pool is big enough https://github.com/oracle/node-oracledb/blob/master/doc/api.md#connpoolmonitor and https://blogs.oracle.com/opal/entry/node_oracledb_1_7_0
@wilkolazki do you have a testcase?
@dmcghan - I'm sure you've got better things to do than a design review of my app, but just in case you're bored here's a quick explanation, with a simple example.
Basically, we need to generate csv files for a number of downstream systems who can't take data in any other way.
For simplicity's sake, we'll say that our database has 10 tables, and we're talking to 5 different ancillary systems who need various subsets of that data. One table (registrations, for example) is the primary and the other 9 tables are linked on a primary key. Ancillary A might need data from all 10 tables, but ancillaries B,C,D and E might only need data from 3 or 4 tables each. So we need to produce 5 data files, and some of the data in the files will overlap
We originally wrote each extract as standalone program, with a big SQL joining the necessary tables. This was okay because it was simple, and worked fine until we get into bigger datasets, and maybe 4 of the SQLs are hitting the same 7 tables and taking a hour each to pull data.
I was familiar with node, and thought this might be an opportunity to divide and conquer. I wrote a node app that has 10 small SQLs, one for each table. The app goes and gets all the qualifying rows from the registration table, and then asynchronously grabs all data from the other tables. Once all the selects are back, I filter the dataset into each file. This app now only pulls the amount of data that the biggest file needed, and produces all five files in basically the same amount of time that the biggest file used to take.
So that's why I need to keep the initial data load in an array and then use it to hit the other tables. I'm sure there's a better way to do it using PL/SQL, but one problem with that is that we have several databases that we're pulling data from, so wouldn't we need to build and maintain the PL/SQL in every database rather than just having one centralized app?
One thing to note - I first built this out using an external JDBC-based api to get the data into my node app, and we never had any memory issues on the node end. I'd think that the array sizes would be the same.
Anyway, I'm definitely open to suggestions on better ways to do this... I'm not sure that streaming is possible, since I need to have all the data before I can start filtering it into each file (a row in the file usually contains data from all tables), but there's probably something that smarter people than me know about
Thanks for all your help so far
@niallosul Sorry for the delayed response. Though I'm a little confused by some of the things you said, it sounds like you've optimized things according to your environment. Just in case you don't mind continuing the discussion...
You said:
We originally wrote each extract as standalone program, with a big SQL joining the necessary tables. This was okay because it was simple, and worked fine until we get into bigger datasets, and maybe 4 of the SQLs are hitting the same 7 tables and taking a hour each to pull data.
What was the original solution? The JDBC-based API you described later?
What do you mean by "...a big SQL joining the necessary tables..." and later "...maybe 4 of the SQLs are hitting the same 7 tables..."?
Have you considered an architecture where you have 1 specifically tuned query for each downstream system? Then you could run them in parallel (depending on the load) in separate node processes without buffering in memory.
If any of those queries were someone complex, maybe involving nested queries, I'd recommend looking into pipelined table functions: http://docs.oracle.com/cd/B28359_01/appdev.111/b28425/pipe_paral_tbl.htm#CHDJEGHC
The idea would be to do the heavy lifting in PL/SQL and pipe the results you want to send to the CSV out. This could potentially save lots of network round trips and prevent buffering in the database and Node.js.
@niallosul Scratch the idea on pipelined table functions... I just ran a test and they do not seem to be working as I'd anticipated (I'll ping Chris about that).
@dmcghan - no worries at all on the delayed response, and thanks for taking the time to look at this.
The original solution was to use our proprietary middleware to generate the files - not worth getting into here, but it's an old SQL-like language that lets you do queries and report/file writing in one place.
The JDBC-based API is what I used when I originally started looking at using node for this. Since we don't own that API, I wanted to remove the dependency and use node-oracledb for the actual queries instead.
Your suggestion to have one specifically tuned query per system is actually what we're trying to get away from - that was the 'big SQL joining the necessary tables' to produce each file. What's been happening with that approach is that we're writing one query per file, but we have hundreds of files to generate, so all the different queries are getting difficult to manage.
Here's a more generalized question that would be great to have an answer to:
Is it better to have hundreds of individual queries pulling data from 10-20 tables, or one query per table with some middleware logic to filter and map the output into hundreds of files?
The latter is what we're working towards, since the individual queries are reusable for different 'main queries', and we can use node to asynchronously hit all the joined tables and pull a superset of data before we start creating files. However, node may not be the best option for the CPU-intensive processing of those large data sets.
Thanks again for any input
@niallosul I think the answer is: 'it depends'.
One thing that does stand out is that the various deployment scenarios sound like potential pain points. Maybe some new tooling would help?
It seems that there is a class of seg fault caused by the garbage collector closing resources which should still be in use. We are looking at some changes in the JavaScript work for 1.9 that should help out.
@wilkolazki @niallosul can you check with node-oracledb 1.9.2 (without any workarounds) and let us know if you still see crashes?
@cjbj, I have just compiled the 1.9.2 driver and I run my app as it was at the moment of reporting this issue. The app is running for some time now without any segfaults.
The new version fixed the problem I had.
Thank you!
I am sorry for no response for so long time. Unfortunatelly I was moved from this app to other projects and could not report back.
@cjbj - looks good for me so far too... Thanks!
@wilkolazki @niallosul thanks for checking.
@wilkolazki I hope you get back to Node soon!
Hi,
I have created a simple rest service which for each request calls oracle a few times. Largest query fetches about 1000 rows using resultSet. The app does work for some time and after a few minutes on n-th request I get Segmentation Fault.
This is what I get from Segfault-handler module:
Here is my config: Client Shared Library 64-bit - 12.1.0.2.0 oracledb: "1.6.0" $ uname -a Linux domx1st 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u5 x86_64 GNU/Linux $ node --version v4.3.1
Node is installed with nvm. Systemwide node version is $ /usr/bin/node --version v0.10.29
What should I do to debug the issue?