googleapis / langchain-google-spanner-python

Apache License 2.0
10 stars 10 forks source link

Spanner loader is flaky - it can abort reading all partitions before all the data is read. #102

Open nielm opened 4 weeks ago

nielm commented 4 weeks ago

The code in main/src/langchain_google_spanner/loader.py#L230 is flaky:

Specifically:

    for partition in partitions:
      r = snapshot.process_query_batch(partition)
      results = r.to_dict_list()
      if len(results) == 0:
        break
      # ... process rows in results

That break should be a continue so that the remaining partitions are read.

Reason being that this is a partitioned query (see https://cloud.google.com/spanner/docs/reads#read_data_in_parallel) which requires you to read all the returned partitions.

Some of these partitions may be empty, some may have data, it depends on how much data is in your tables, the query, and some Spanner internals.

This code aborts iterating through the partitions as soon as it finds the first empty partition

An additional (minor) error in this code is that the database snapshot is never closed, which is a resource leak.

Another is that this code will fail if the caller takes too long (more than 1hr) iterating over the returned values, due to some of the partitions not having been read and timing out...

Note that this partitioned query API is not that useful in this use case. It is meant to be used in a multi-threaded or multi-machine/multi-process environment (eg Dataflow/Flume) where partitions of data are processed in parallel. The code here simply processes the partitions in series (aborting if one is empty).

There is a much simpler way of querying the database, which simply returns a single ResultSet: https://cloud.google.com/python/docs/reference/spanner/latest/snapshot-usage#execute-a-sql-select-statement (see the top of the page for using a snapshot with a fixed staleness timestamp)

I recommend removing all partitioned queries from the spanner loader code, and use simple queries.

aperepel commented 4 weeks ago

I confirmed this 1-liner fixes all the flaky load issues we saw. Next step would be to explore a simple API.