plv8 / plv8

V8 Engine Javascript Procedural Language add-on for PostgreSQL
https://plv8.github.io/
Other
1.92k stars 140 forks source link

Cost of calling plv8 functions #224

Closed mauritslamers closed 7 years ago

mauritslamers commented 7 years ago

(Note: This issue was originally about the cost of calling plv8.execute(), but the problem I have been noticing seems to be different).

I have found that using plv8.execute() is rather costly.

In my case I have a stored function called readrecords which calls a function called _retrievelastestrecords which generates the correct query, executes it, then returns the data. readrecords does a transformation on the resulting json data, then returns the data to the client.

Even with all possible indexes, this procedure takes around 1200ms to complete. The actual resulting query only takes about 430ms to run if executed by hand. I added some logging through new Date().toJSON(), in order to find where the time is spent.

INFO:  start of readRecords: 2017-06-19T08:00:47.413Z
INFO:  finished finding function2017-06-19T08:00:47.413Z
INFO:  start of _retrieveLatestRecords: 2017-06-19T08:00:47.413Z
INFO:  after searching _generateComparison: 2017-06-19T08:00:47.413Z
INFO:  after generating query: 2017-06-19T08:00:47.413Z
INFO:  after execution of query: 2017-06-19T08:00:48.086Z
INFO:  return of _retrieveLatestRecords2017-06-19T08:00:48.086Z
INFO:  finished converting to flatjson2017-06-19T08:00:48.095Z

Time: 1253.829ms

It is difficult to figure out in PostgreSQL how much data actually is being transferred between functions, but by taking the resulting JSON and checking the length, it is just under 5MB. From the timing information the most striking bit is that running the resulting query by hand takes around 430ms, but the time it takes to run plv8.execute with the same query according to the timestamps taken immediately before and after is 673ms.

What I still don't get exactly is where the remaining 500 to 600ms are lost. At first I thought that transferring the result from one stored function to the other is costly as well, but the time difference between the after execution of query (last action before returning the result in _retrieveLatestRecords) and return of _retrieveLatestRecords (first action performed after the result arrives in readrecords) is essentially zero. So, my issue is two-fold: it seems to be obvious that plv8.execute adds quite a cost to every query. At the same time somewhere in the entire process a lot of time is lost and I wonder what I can do to improve the performance as it doesn't seem to be my embedded code which is causing the delay.

As stated in the issue title, I am running PostgreSQL 9.6 with plv8-1.4.8 as distributed through the PostgreSQL Ubuntu repository at http://apt.postgresql.org under Ubuntu 16.04.

mauritslamers commented 7 years ago

Some additional information: I have been trying to find out more with help of the people in the #postgresql IRC channel on freenode.

  1. The amount spent before the query is started and the execution entering the function is very small, in the order of 2ms
  2. By adding a limit 1 to the call to the stored function the total time of the query is brought back to 846ms on an execution time of the created query of 591ms.
  3. Using SET client_min_messages TO debug3 doesn't bring any extra information.
JerrySievert commented 7 years ago

there's a lot of data conversion occurring under the hood in plv8_type.cc, converting from postgres native types, to v8 native types, to javascript (that latter part is really expensive, and one of the reasons why pure javascript functions are often faster than native functions in node.js).

this makes it much more difficult to see where the time is going. in the case of plv8.execute(), it's a call back to c++/c, then down a subtranscation and memory contexts in postgres itself, then the returned values are coerced back to javascript.

given that you're using 1.4.8 (which is only supported for critical bug fixes), there is likely nothing that can really be done, but if you send me an example schema/query, I can fill with dummy data and try on something more modern (v8 3.14 is over 4 years old, and current version of plv8 runs against v8 5.8.301), and see if I can reproduce the issue.

mauritslamers commented 7 years ago

Thanks for the quick reply. I took the opportunity to update plv8 to v2.0.0 using the manual static build. I was planning for ages to update, so I took the leap. I have updated the topic of this issue, because I noticed that the main problem I have been experiencing doesn't seem to be related with plv8.execute() as much as I thought.

My original suspicion was that when plv8 functions are called, independent of whether they are being called from the SQL side or the plv8 side (through plv8.find_function) the result would have to cross the JS/C++ boundary which is known to be costly in general, and in the case of plv8 even a bit more (as you have explained above).

However, this original suspicion doesn't seem to hold up. I added some timestamp logging in my stored procedures by using plv8.elog(INFO, 'status text', new Date().toJSON()). Assuming these timestamps are accurate, I found that

  1. calling plv8.execute adds extra time to a query because of data conversion. In my case, the raw query ran in 450ms, and by running it through plv8.execute() this increased to just under 600ms.
  2. using the new plv8 2.0.0 (and a new day), the stats were a bit different: the raw query ran in 530ms, running it through plv8.execute() took around 700ms, but the total time of the query was around 1040ms.
  3. Using some tests I figured out that the time from starting the query from the SQL side to the start of the function on the JS side is negligible.
  4. After the main query returns, there is an amount of time spent which seems typically twice that of the cost of using plv8.execute(). This is strange because the amount of data returned by the raw query is not much different from what is being returned by the outer function, which is about 5Mb of data.

Table layout:

                                    Table "public.sometable"
   Column   |           Type           |                        Modifiers                        
------------+--------------------------+---------------------------------------------------------
 id         | integer                  | not null default nextval('sometable_id_seq'::regclass)
 type       | character varying(50)    | not null
 data_id    | integer                  | 
 is_deleted | boolean                  | default false
 created_on | timestamp with time zone | 
 created_by | jsonb                    | 
Indexes:
    "sometable_pk" PRIMARY KEY, btree (id)
    "sometable_dataid_idx" btree (data_id)
    "sometable_isdeleted_idx" btree (is_deleted)
    "sometable_recordtype_idx" btree (type)
Foreign-key constraints:
    "matching_data" FOREIGN KEY (data_id) REFERENCES sometable_history(id) DEFERRABLE
Referenced by:
    TABLE "sometable_history" CONSTRAINT "matching_rec" FOREIGN KEY (record_id) REFERENCES sometable(id) DEFERRABLE

                                    Table "public.sometable_history"
   Column    |           Type           |                            Modifiers                            
-------------+--------------------------+-----------------------------------------------------------------
 id          | integer                  | not null default nextval('sometable_history_id_seq'::regclass)
 type        | character varying(50)    | 
 sha1        | character(40)            | not null
 sha1_parent | character(40)            | 
 record_id   | integer                  | 
 data        | jsonb                    | 
 modified_on | timestamp with time zone | 
 modified_by | jsonb                    | 
Indexes:
    "sometable_history_pk" PRIMARY KEY, btree (id)
    "sometable_history_data_idx" gin (data)
    "sometable_history_recordtype_idx" btree (type)
    "sometable_history_sha1_idx" btree (sha1)
    "sometable_history_sha1parent_idx" btree (sha1_parent)
Foreign-key constraints:
    "matching_rec" FOREIGN KEY (record_id) REFERENCES sometable(id) DEFERRABLE
Referenced by:
    TABLE "sometable" CONSTRAINT "matching_data" FOREIGN KEY (data_id) REFERENCES sometable_history(id) DEFERRABLE

The query is run as:

select * from readrecords('sometable', '{"exam_date": { "and": { "gte":"2016-08-31T22:00:00.000Z", "lte":"2017-08-31T22:00:00.000Z"}}}'::jsonb);

The query as generated by the stored procedures is:

SELECT sometable_history.data,sometable_history.type as history_type, sometable_history.sha1, sometable_history.sha1_parent, sometable_history.modified_by, sometable_history.modified_on, sometable.id, sometable.type as record_type, sometable.data_id, sometable.created_by, sometable.created_on, sometable.is_deleted FROM sometable INNER JOIN sometable_history on sometable.data_id = sometable_history.id WHERE sometable_history.data->>'exam_date' >= '2016-08-31T22:00:00.000Z' AND sometable_history.data->>'exam_date' <= '2017-08-31T22:00:00.000Z' AND sometable.is_deleted = false;

The timestamp logging is:

INFO:  start of readRecords: 2017-06-20T15:00:08.234Z
INFO:  finished finding function2017-06-20T15:00:08.234Z
INFO:  start of _retrieveLatestRecords: 2017-06-20T15:00:08.234Z
INFO:  after searching _generateComparison: 2017-06-20T15:00:08.234Z
INFO:  after generating query: 2017-06-20T15:00:08.234Z
INFO:  after execution of query: 2017-06-20T15:00:08.946Z
INFO:  return of _retrieveLatestRecords2017-06-20T15:00:08.946Z
INFO:  finished converting to flatjson2017-06-20T15:00:08.948Z
Time: 1038.987 ms

The stored procedures transform the result return from this procedure to a set of jsonb documents where the contents of the data object is added to the root of the object and the data property is removed.

JerrySievert commented 7 years ago

why not just generate the query in code calling the query itself, and not in the stored procedure?

one thing to note is that the shape of the calls change significantly depending on how you're calling the query:

SELECT myfunc(foo) FROM bar

will execute the query, and invoke the stored procedure.

whereas:

var results = plv8.execute('SELECT foo FROM bar');
do_something_with(results);

will use SPI, fetch all of the results, create a new array in v8, massage all of the types, and return the whole of all results.

that's a pretty big difference in execution time.

in another life, we had a query that we could run in psql that would return in a few seconds, but when run in production, it would take 3-4 minutes. the differences were:

if you can get away with either rewriting the query, or doing the work without plv8.execute, you'd be much better off.

for instance, creating a view that aggregates the rows in as JSONB and doing something like this:

CREATE OR REPLACE FUNCTION jsonb_search (data JSONB, query JSONB) RETURNS JSONB
AS $$

/*
{
  'time': {
    'between': [ 1495871252970, 1495872252970]
  },
  'metric': 'system.load',
  'tags': [ 'loadavg' ],
  'hostname': 'example.com',
  'data': {
    '1min': {
      'gt': 1
    }
  }
}
*/

  function compare (left, right, operator) {
    if (operator === 'between') {
      return (left >= right[0] && left < right[1]);
    }

    if (operator === 'eq') {
      return (left == right);
    }

    if (operator === 'gt') {
      return (left > right);
    }

    if (operator === 'gte') {
      return (left >= right);
    }

    if (operator === 'lt') {
      return (left < right);
    }

    if (operator === 'lte') {
      return (left <= right);
    }
  }

  var rows = [ ];

  for (var i = 0; i < data.rows.length; i++) {
    var forward = false;
    if (query.time) {
      // between, eq, gt, gte, lt, lte
      if (query.time.between) {
        forward = compare(data.rows[i]._internal.time, query.time.between, 'between');
      }
    }

    if (forward && query.data) {
      var keys = Object.keys(query.data);

      for (var j = 0; j < keys.length && forward; j++) {
        var key = keys[j];

        var operators = Object.keys(query.data[key]);

        for (var k = 0; k < operators.length && forward; k++) {
          var operator = operators[k];

          forward = compare(data.rows[i][key], query.data[key][operator], operator);
        }
      }
    }

    if (forward) {
      rows.push(data.rows[i]);
    }
  }

  return rows.length ? { rows: rows } : null;
$$ LANGUAGE plv8;

might be a bit faster.

mauritslamers commented 7 years ago

I think you might have missed my change in focus in this issue.

My main reason for using stored procedures to do things as read and find data is that the format in which it is stored is a bit different from how I want to represent it. I decided therefore to implement an API inside the db which does the conversion in two directions, in this way abstracting away the different format of the stored data to any middleware using it.

My main issue is not the performance of plv8.execute() as such, but the time spent after the invoked stored procedure returns. plv8.execute() adds around 150ms to my query, which is acceptable. However, after my function readrecords returns, around 300ms is spent somewhere and I don't really understand where. If this value would be closer to 150ms it would make sense, as it would be similar to the overhead of plv8.execute(). But as this time loss is significantly larger, I am wondering what causes this.

JerrySievert commented 7 years ago

how many rows/columns are you returning, and of what types?

then I can calculate a good approximation of how many conversions and of what types are occurring.

mauritslamers commented 7 years ago

Thereadrecords function returns a single column of jsonb (because the stored function is defined as SETOF jsonb), with 334 rows. The length of the result as JSON string is 5070161.

JerrySievert commented 7 years ago

Waiting on a file system conversion to finish before I can give you specifics (you caught me in the middle of high sierra updates), but:

Json conversion is fairly straightforward but are you using jsonb or json? There are some subtle differences I can help explain, and when to use each with plv8.

Will explain how the 334 rows are generated when I get laptop back.

mauritslamers commented 7 years ago

as described in the previous comment, the outer function invoked from SQL returns a SETOF jsonb.

mauritslamers commented 7 years ago

I'd love some insight in which format to use when, as I rather don't want to give up my in-db API and the more effective that code can be, the better!

JerrySievert commented 7 years ago

let's look at the pro's and con's of which format you're using, for what you're trying to do. NOTE, this is strictly in the case of PLV8, and does not necessarily apply to postgres as a whole, or any other language plugin:

JSON

Pro

Con

JSONB

Pro

Con

in your case, it appears that it would be best to have your base tables to be JSONB, that's awesome, you're searching from them, and returning some results.

the question I have is: are you doing anything else with the results of the readrecords call inside of postgres, or are you simply returning the results back to the client?

if it's the latter,JSON will be your winner. let's look at the code path for returning with JSON:

https://github.com/plv8/plv8/blob/master/plv8_type.cc#L326

that's a simple cast to text, and letting the database do the minimal it can (validation), and returning essentially text back to the client.

using JSONB is a different story:

https://github.com/plv8/plv8/blob/master/plv8_type.cc#L313

this is doing a transformation inside of PLV8 to make sure that it returns correctly formatted JSONB for output back into the database itself - it creates the full data structure, which is much more expensive than a simple validation.

once again, if you're going to do ANYTHING in the database with the data, JSONB is going to be your winner, but if you're simply going to return it to the client, use the path of least resistance and use JSON instead. you can change that by simply returning SETOF JSON instead of SETOF JSONB.

an additional thought is that instead of returning a set, it might be a little cleaner to collect the results in memory and return a single result like:

{
  "results": [
      (row1),
      (row2)
   ]
}

etc, instead of returning a SETOF and returning an actual array. that will save you 333 of your 334 conversions. that should improve things quite a bit.

mauritslamers commented 7 years ago

This is a great example of best practice (which I realized was one of my unconscious objectives). I will try a few different approaches and report back here as a way of documenting those best practices.

mauritslamers commented 7 years ago

Some preliminary results: choosing json instead of jsonb for the main readrecords function (which is part of the public API and not used internally) speeds up the return overhead from 300-500ms to around 50ms. It doesn't seem to matter much whether a setof json is used or plainly json so for the moment I am sticking with SETOF json as that doesn't require changing my middleware. I also managed to get the core query time down to around 130ms by introducing some indexes, leading to the following timing:

INFO:  start of readRecords: 2017-06-25T09:11:55.632Z
INFO:  finished finding function2017-06-25T09:11:55.632Z
INFO:  start of _retrieveLatestRecords: 2017-06-25T09:11:55.632Z
INFO:  after searching _generateComparison: 2017-06-25T09:11:55.632Z
INFO:  after generating query: 2017-06-25T09:11:55.633Z
INFO:  after execution of query: 2017-06-25T09:11:55.939Z
INFO:  return of _retrieveLatestRecords2017-06-25T09:11:55.939Z
INFO:  finished converting to flatjson2017-06-25T09:11:55.942Z
Time: 372.000 ms
JerrySievert commented 7 years ago

this actually made me switch my own procs to return JSON over JSONB.

going to go ahead and close this ticket though - I'll reference it for other questions though.

mitar commented 5 years ago

Yes, a similar thing I noticed also for node-postgres driver, for transport, JSON is better than JSONB: https://mitar.tnode.com/post/in-nodejs-always-query-in-json-from-postgresql/

JerrySievert commented 5 years ago

2 things:

  1. you can use SPI and get the data back raw
  2. the JSONB vs JSON speed differences were well known and addressed, JSONB now does a direct conversion in c++ into v8 objects and runs faster than a JSON conversion

as for your benchmarks, in order to fully understand why, you need to understand how JSON is stored in postgres and what that entails. see https://legitimatesounding.com/blog/json_v_jsonb.html for a more in-depth response. it explains the difference and how/why those changes can be affected in postgres. modern plv8 works directly with those data structures now.

mitar commented 5 years ago

as for your benchmarks

I was benchmarking the transport time, not query time. I think the main difference is how node-postgres has to parse the jsonb payload from the wire, vs json payload from the wire (for the latter it can use internal v8 JSON parser, which is much faster than jsonb parser).

Thank you for the blog post. I have read it through!

modern plv8 works directly with those data structures now.

I have to learn more about how this is done. Which you described here.

JerrySievert commented 5 years ago

plv8 directly access the jsonb data structures, there's nothing to parse or serialize, it's simply iterating through the data structure in c/c++ directly from memory - at that point there is no separation between plv8 and postgres, it's not using a public API to interface, it's simply reading the data in the native postgres storage format and dealing with it without an intermediary step.

editing to add, when it returns jsonb back to postgres, it returns it in Postgres's native jsonb format, not as json or text.

JerrySievert commented 5 years ago

and yes, that means that whenever postgres changes its internals plv8 has to change as well - there are #if statements all over the place to deal with postgres changes over the years.

mitar commented 5 years ago

it's simply reading the data in the native postgres storage format

Yes, in #365 I am asking how I could access this data in the native postgres storage format directly, skipping any parsing. Because I would just like to proxy it through.

JerrySievert commented 5 years ago

Yes, in #365 I am asking how I could access this data in the native postgres storage format directly, skipping any parsing. Because I would just like to proxy it through.

by writing your own extension to intercept the data, plv8 cannot do that by design, it can only input and output postgres data types:

that's what a language extension is, and a trusted language extension doubly so.

JerrySievert commented 5 years ago

to expand:

there are 2 ways to break out of that:

  1. directly access postgres internals, which plv8 will not do for you by design, but if you write an extension you can absolutely do it yourself
  2. directly access system calls or anything outside of postgres, which plv8 will not do for you by design, but if you write an extension you can absolutely do it yourself

the difference is that plv8 is a trusted extension - it can do what you want, but then it would be an untrusted extension. that's a postgres distinction. also, I've yet to see a hosting provider provide untrusted extension access, and most dba's likely won't either. it's essentially like letting anyone write a kernel driver and let them run it as root.

there's a reason why what you get out of plv8.execute() is what you get out of plv8.execute(): because that part is using SPI, which provides a safe and practical way for untrusted code to access the database and still check what permissions it needs to. the places where that doesn't make sense is when postgres sends or receives data from plv8, and whenever an extension goes outside of those bounds, it either turns into an untrusted extension, has to be extremely careful to mitigate any issues (which plv8 does in the few places it goes outside those bounds).

mitar commented 5 years ago

all information leaving a plv8 function comes as a postgres type

Sure, so it could return BYTEA view of the internal Datum representing the row, no?