isoos / postgresql-dart

Dart PostgreSQL driver: supports extended query format, binary protocol and statement reuse.
https://pub.dev/packages/postgres
BSD 3-Clause "New" or "Revised" License
131 stars 37 forks source link

RangeError in server_messages.dart #387

Open ryanheise opened 1 week ago

ryanheise commented 1 week ago

I just noticed this in my logs today:

Zone error: RangeError: Value not in range: -1603639493
#0      _rangeCheck (dart:typed_data-patch/typed_data_patch.dart:5113)
#1      _ByteBuffer.asUint8List (dart:typed_data-patch/typed_data_patch.dart:1928)
#2      new Uint8List.view (dart:typed_data:903)
#3      ByteDataReader.read (package:buffer/buffer.dart:407)
#4      new DataRowMessage.parse (package:postgres/src/messages/server_messages.dart:186)
#5      _messageTypeMap.<anonymous closure> (package:postgres/src/message_window.dart:23)
#6      MessageFramer.addBytes (package:postgres/src/message_window.dart:80)
#7      _readMessages.<anonymous closure>.<anonymous closure>.handleChunk (package:postgres/src/v3/protocol.dart:73)
#8      _rootRunUnary (dart:async/zone.dart:1407)
#9      _CustomZone.runUnary (dart:async/zone.dart:1308)
#10     _CustomZone.runUnaryGuarded (dart:async/zone.dart:1217)
#11     _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:365)
#12     _BufferingStreamSubscription._add (dart:async/stream_impl.dart:297)
#13     _SyncStreamControllerDispatch._sendData (dart:async/stream_controller.dart:784)
#14     _StreamController._add (dart:async/stream_controller.dart:658)
#15     _StreamController.add (dart:async/stream_controller.dart:606)
#16     _Socket._onData (dart:io-patch/socket_patch.dart:2455)
#17     _rootRunUnary (dart:async/zone.dart:1415)
#18     _CustomZone.runUnary (dart:async/zone.dart:1308)
#19     _CustomZone.runUnaryGuarded (dart:async/zone.dart:1217)
#20     _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:365)
#21     _BufferingStreamSubscription._add (dart:async/stream_impl.dart:297)
#22     _SyncStreamControllerDispatch._sendData (dart:async/stream_controller.dart:784)
#23     _StreamController._add (dart:async/stream_controller.dart:658)
#24     _StreamController.add (dart:async/stream_controller.dart:606)
#25     new _RawSocket.<anonymous closure> (dart:io-patch/socket_patch.dart:1942)
#26     _NativeSocket.issueReadEvent.issue (dart:io-patch/socket_patch.dart:1385)
#27     _microtaskLoop (dart:async/schedule_microtask.dart:40)
#28     _startMicrotaskLoop (dart:async/schedule_microtask.dart:49)
#29     _runPendingImmediateCallback (dart:isolate-patch/isolate_patch.dart:118)
#30     _RawReceivePort._handleMessage (dart:isolate-patch/isolate_patch.dart:185)

After this happens, the connection is essentially hung on the query forever. If I go into psql from the console and select from pg_stat_activity I can see all the queries are in the idle state and nothing is blocking them. This is not recoverable unless I manually kill the idle processes with pg_terminate_backend.

I can only see this exception by listening for zone errors so I can't directly catch it at the point in my code where the exception was triggered. I guess the exception should not happen in the first place, but if it does happen, it would be nice if server_messages.dart handled the exception and forwarded it back to the original call site.

I believe there might also be some other scenarios where the connection can hang, and the only way is to kill the idle processes manually. However, this is the only time when I've been given an actual exception to report. Since those other cases are still unknown, I am hoping there is an explanation in general for why this package might be hanging on idle processes where nothing is blocking them (I'm assuming it's something like the processes are remaining in the idle state because this package hasn't yet closed the cursors or something). As yet, I haven't worked out a way to reproduce it other than letting the application run for long enough until by chance the DB connection hangs. The stacktrace above, however, has only ever happened once, so I'm not sure if I'm likely to ever see it again.

This exception occurred with release 3.4.2.

isoos commented 1 week ago

@ryanheise thanks for reporting this, it is worth to investigate what we can improve here. Why postgresql version are you using? Any special extension or data type that could have been part of the result?

ryanheise commented 1 week ago

Here is the output from within psql:

postgres=# select version();
                                                       version
---------------------------------------------------------------------------------------------------------------------
 PostgreSQL 15.4 (Debian 15.4-2.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

postgres=# \dx
                                    List of installed extensions
  Name   | Version |   Schema   |                            Description
---------+---------+------------+-------------------------------------------------------------------
 amcheck | 1.3     | pg_catalog | functions for verifying relation integrity
 pg_trgm | 1.6     | public     | text similarity measurement and index searching based on trigrams
 plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural language
 vector  | 0.5.1   | public     | vector data type and ivfflat and hnsw access methods

Although I did not log the specific SQL query, there are queries based on the pgvector extension, both to insert/update embeddings and also to query embeddings. In each of these cases, I am passing the vectors in their string/text representation (I wasn't able to get it to work directly as an array).

ryanheise commented 1 week ago

I actually do have the idle query saved from my psql session. I can't give the entire schema, but mytable has columns with the following types:

......     boolean                                 
myhash     bytea                           not null
......     integer                                 
......     integer                         not null
......     text                                    
......     text                            not null
......     text[]                                  
......     text[]                          not null
......     timestamp(3) with time zone             
......     timestamp(3) with time zone     not null
......     vector(512)                             

And the idle query was long these lines:

SELECT m.*, encode(m.myhash, 'base64') AS myhashBase64
FROM mytable m
WHERE ...
isoos commented 1 week ago

@ryanheise these were helpful, thank you! I did not explore the actual parsing error yet, but I have found a missing part where the protocol-level bug was not bubbling up, and I wrote a test for it #388. I've published a new version, hopefully it will expose the actual protocol-level bug with better details.