Python client significantly slower when retrieving query results than Studio

thomaschristopherking commented 2 years ago

Description

When running a comparison on getting a single attribute using the python client, compared to Studio, Python is around five times slower.

Environment

TypeDB version: 2.10.0
OS of TypeDB server: Windows 11
Studio version: 2.10.0-alpha-4
OS of Studio: Windows
Other environment details:

Reproducible Steps

Steps to create the smallest reproducible scenario:

Load the following schema:

define
first_name  sub attribute, value string;

Run the following Python script (providing a suitable db name):


import time
import uuid

from typedb.client import TypeDB, SessionType, TransactionType, TypeDBTransaction

db_name = "<put-db-name-here>"

def create_first_names():
        with TypeDB.core_client("localhost:1729") as client:
            with client.session(db_name, SessionType.DATA) as session:
                with session.transaction(TransactionType.WRITE) as tx:
                    for i in range(50_000):
                        insert_name = f'insert \'{uuid.uuid4()}\' isa first_name;'
                        tx.query().insert(insert_name)
                    tx.commit()

def test_first_name_fetch():
    match_name = 'match $fn isa first_name; get $fn;'
    tx: TypeDBTransaction
    with TypeDB.core_client("localhost:1729") as client:
        with client.session(db_name, SessionType.DATA) as session:
            with session.transaction(TransactionType.READ) as tx:
                start = time.time()
                matches = tx.query().match(match_name)
                for m in matches:
                    name = m.map().get('fn').get_value()
                diff = time.time() - start
                print(f'method_execution_time_seconds: {diff}')

if __name__ == "__main__":
    create_first_names()

    # Executed twice, first time to ensure query plan caching, second is the actual test.
    test_first_name_fetch()
    test_first_name_fetch()

Compare the runtime to executing the following query in Studio: match $fn isa first_name; get $fn;

Expected Output

A similar execution time for both Python and Studio queries.

Actual Output

Python time: ~5.5s Studio time: ~1.4s to execute, and ~1.972s to output results

alexjpwalker commented 2 years ago

@thomaschristopherking I've managed to reproduce the example locally, and it contains some interesting findings!

I took your reproducible sample and made a small addition, adding a per-answer performance profiler:

start = time.time()
matches = tx.query().match(match_name)
+time_diffs = []
for m in matches:
    name = m.map().get('fn').get_value()
+    time_diffs.append(time.time() - start)
diff = time.time() - start
+print(' '.join((str(d) for d in time_diffs)))
print(f'method_execution_time_seconds: {diff}')

What this reveals is that the answers come in "batches" of size 50 - the first 50 answers arrive in, say, 0.005s, then there is a 0.005s gap, then another 50 answers arrive, then there is another 0.005s gap, and so on. This indicates that there is a bug in the implementation of prefetch size - and sure enough, I've tracked it down. It manifests itself when connecting to localhost, and occurs due to the following logical flaw.

Answers are streamed in batches of size N from the server (where N = prefetch_size, default 50), to prevent the server doing unnecessary work in case the client does not end up consuming all answers. Once the client sees N answers, it should send a "CONTINUE" request to the server to continue streaming.

However, while the Nth answer is being sent to the client, and while the server is waiting to receive the CONTINUE request, the streaming should actually continue. If it doesn't, we end up with "wasted time" where the server is waiting and isn't sending anything. Thus, the server must predict to the best of its ability when the client will send the next CONTINUE. This is typically equal to the network latency.

However, when connecting to localhost, the network latency is 0 - while it is physically impossible for the client to respond to the server at the exact same moment that the server sends the Nth answer. localhost is an edge case that is currently unhandled.

To mitigate the problem, we should coerce the measured network latency to be at least 1ms.

(Interestingly, this is an issue that affects all clients, not just Client Python, and therefore the Python-specific issue must be a different issue; but with this change, we already speed every client up by over 2x when connecting to localhost, which is a massive boost!)

thomaschristopherking commented 2 years ago

Hi Alex, thanks for looking into this, it makes sense. If it's just for localhost then I guess it's not such a big deal (although I think our network latency is low in the cloud as well). But why does this not occur in Studio?

alexjpwalker commented 2 years ago

Why indeed - the issue affecting client-python must be a different issue. So I'm still going to keep this issue open as it remains unresolved; however, I'll push out the latency measurement change to all clients - a valuable boost for anyone developing on their local machine (in any language).

https://github.com/vaticle/typedb-client-python/pull/259

TheDr1ver commented 2 years ago

For what it's worth, just in case it helps with narrowing down the Python speed issues, I've noticed that it takes significantly more time to run a single complex query (e.g., starting with 5 Attributes, find all Entities that have those attributes (maybe 3 matches?), then find all Relations that have those Entities as Players (maybe 5 Relation matches?)) than it does to simply dump every thing in the database and post-process all the results in Python.

If I've got a batch of 800 or so queries (resetting read tx's after every 25 queries run) and I'm watching the log of each query being run in real-time, I've noticed that when it hangs on getting a result, it always seems to hang on the more complex queries rather than the short ones.

My point is, to a certain extent, I think the complexities of a specific search (or maybe a complex search thrown into a batch of searches?) contributes to the slow-down much more significantly than just the act of retrieving all the data.

alexjpwalker commented 2 years ago

Absolutely @TheDr1ver - the impact of a client-server connection's inefficiency will be minimal when the query itself is complex to execute on the server. match $x isa first_name is one of the simplest queries possible in TypeDB.

Having said that, if your query is significantly slower than dumping the whole DB and post-processing results in Python, that may indicate a query planning issue and may be worth raising in https://github.com/vaticle/typedb. For instance, if retrieving all people with a filter like match $p isa person, has age $age; $age < 40; were to perform significantly worse than filtering by age in Python as follows:

[(cm.get("p"), cm.get("a")) for cm in tx.query().match("$p isa person, has age $a;") if cm.get("a").get_value() < 40]

then that would certainly indicate a bug in the query planner.

alexjpwalker commented 2 years ago

@thomaschristopherking, @TheDr1ver : I believe we've found a conclusive answer to the inefficiency of Client Python (versus Clients Java and Node.js):

https://grpc.io/docs/guides/performance/#python

Streaming RPCs create extra threads for receiving and possibly sending the messages, which makes streaming RPCs much slower than unary RPCs in gRPC Python, unlike the other languages supported by gRPC.

This implies that the behaviour we are seeing (where Python is significantly slower at retrieving query answers than Studio, which uses Client Java) is expected in gRPC Python.

The gRPC team propose one potential fix, which is refactoring the code to use asyncio. However, this is a significant undertaking, and with the development of typedb-client-rust due to be completed in the fairly near future, we will be rewriting Client Python as a thin Python wrapper over an underlying Rust library - which should run optimally, and resolve this issue.

In the meantime, if optimality is essential for a product, we should recommend the usage of Client Java, Node.js or Julia, via a separate microservice that connects to a user's Python program if needs be.

thomaschristopherking commented 2 years ago

Thank you @alexjpwalker that makes sense, another Python annoyance. Guessing this issue could be closed (or linked to the rust client implementation?). I think we can wait for the Rust client.

alexjpwalker commented 2 years ago

Absolutely - this issue will now form part of the TypeDB Client Rust Rewrite. Thanks again for flagging this issue up!

alexjpwalker commented 1 year ago

Closing this issue as it (should) largely dissolve when we rewrite the Python client to call the Rust client through FFI.

vaticle / typedb-driver-python