toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync
https://pgsync.com
MIT License
1.1k stars 172 forks source link

Memory leak when using geo_point #531

Open mbianchihr opened 4 months ago

mbianchihr commented 4 months ago

PGSync version: 3.1.0 Postgres version: 16 Elasticsearch/OpenSearch version: 8.11.4 Redis version: 7.2.4 Python version: 3.11.6 Problem Description:

When marking fields as geo_point memory leak.

[
    {
        "database": "db",
        "index": "delivery",
        "plugins": [
            "Multisource"
        ],
        "nodes": {
            "table": "parcel",
            "schema": "public",
            "columns": [
                "id",
                "externalId",
                "status",
                "weight",
                "statusChangeTime",
                "statusNote",
                "createdAt"
            ],
            "transform": {
                "mapping": {
                    "location": {
                        "type": "geo_point"
                    },
                    "proofLocation": {
                        "type": "geo_point"
                    }
                }
            },
          ...
      }
   }
]

Then in plugin I have following "rewrite" since the location fields are deep down and are remapped to top level.

from pgsync import plugin

class MultisourcePlugin(plugin.Plugin):
    name = 'Multisource'

    def transform(self, doc, **kwargs):

       ...

       doc['location'] = None
       doc['proofLocation'] = None

       ...

       if doc['delivery'].get('location'):
            doc['location'] = doc['delivery']['location']['coordinates']

       if doc['delivery'].get('proofLocation'):
            doc['proofLocation'] = doc['delivery']['proofLocation']['coordinates']

       ...

When not using this there barely any memory used, and redis is in both cases using around 6-10mb of ram. The real issue lies in memory usage that makes the EC2 fail and become unresponsive at around 200 000 rows with usage of 3.7GB of ram, with our current data set of 1.5 milion it would eat up to 32+GB of ram if estimating from current usage.

This is happening on initial sync, what I suppose is happening is data is not being cleared from memory because fields do not exist in the schema/top level.

However I would love to be pointed to correct place, and yes I sadly have to have these fields on top level instead of parcel.delivery.location.

Sync with geo_point - 150k/1.5mil synced image

Sync without geo_point - 600k/1.5mil synced image