toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync
https://pgsync.com
MIT License
1.14k stars 180 forks source link

Unused Function causing Memory Leak? #316

Open markhalonen opened 2 years ago

markhalonen commented 2 years ago

PGSync version: master

Postgres version: 13.7

Elasticsearch version: 8.3.2

Redis version: 7

Python version: 3.8.10

Problem Description:

We have a memory leak we're trying to track down.

Can I ask what the utility of this function is: https://github.com/toluaina/pgsync/blob/1489a24815454800a230355fafadf77c9ac7442f/pgsync/base.py#L402

It seems to only issue a select statement and return the results, but the callers doesn't use the results: https://github.com/toluaina/pgsync/blob/1489a24815454800a230355fafadf77c9ac7442f/pgsync/sync.py#L441

https://github.com/toluaina/pgsync/blob/1489a24815454800a230355fafadf77c9ac7442f/pgsync/sync.py#L1242

If I just replace the body of this function with return 1, I don't get any errors and it appears to at least improve the memory leak.

Potentially related? https://github.com/sqlalchemy/sqlalchemy/issues/7875

@zacherkkila @bhalonen

toluaina commented 2 years ago

Hi @markhalonen

markhalonen commented 2 years ago

Ok, will be interested to see. Maybe instead of select * we could try select txid or something?

toluaina commented 2 years ago

We could simply replace this line with self.execute(statement) and that should have the same effect. And yes we can also limit to a single column such as txid

toluaina commented 2 years ago

Ok so I pushed some changes in cec8e13 commit . Can you let me know if that has the same effect on memory usage

bhalonen commented 2 years ago

There is no effect from this change on memory usage.

toluaina commented 2 years ago

I found the cause of this. The Node class is created multiple times and needs to implement a singleton pattern. I'll have a fix for this soon

toluaina commented 2 years ago

This has been addressed in the latest commit to master. I'm still doing some testing but you try it out and let me know please.

bhalonen commented 2 years ago

@toluaina we are still seeing memory leaking with the lastest commit. We have a somewhat larger nester schema, so query size or query caching could be the source of the leak.

bhalonen commented 2 years ago

@toluaina we did not see leaking with a flat structure and we've also seen slow leaks on an inactive db, building up over 24 hours. not sure if that helps

toluaina commented 2 years ago

this helps a lot thanks.

bhalonen commented 2 years ago

@toluaina it is possible it was still leaking with a flat structure, but certainly much much slower.

bhalonen commented 2 years ago

@toluaina https://github.com/bhalonen/pgsync-mwe-leak this example absolutely leaks memory slowly over time, the spam.sh loading causes memory to ramp but I'm not sure if it leave uncollected memory. It isn't as high of a quality of reproduction as imaginable, but I did see steady state memory usage climb 20% over several hours.

toluaina commented 2 years ago
bhalonen commented 2 years ago

I have also run it with master and also had memory leaks, (mounted file system, cloned, ran make install) I think I can get a higher quality reproduction that isn't so slow... I'll also verify it isn't fixed on this release.

bhalonen commented 2 years ago

You did have some more changes besides the one I tested.

toluaina commented 2 years ago

Correct I had some other minor changes which should not have too much impact.

bhalonen commented 2 years ago

@toluaina After testing today, we still have a leak. All my attempts to attach memray resulted in segmantation faults, with no interesting trace result in gdb. However, I was playing around with async mode, and noticed it does --not-- leak memory, but it does not actually update elastic either.

this is example logs for async under a write load that sync a) updates search results for, and b) leaks memory. Async does neither... Not sure if this helps. will try to make the example leak fast.

Async steelhead Xlog brent ver: [0] => Db: [0] => Redis: [total = 0 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [4] => Redis: [total = 4 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [4] => Redis: [total = 4 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [73] => Redis: [total = 72 pending = 1] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [542] => Redis: [total = 542 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [1,081] => Redis: [total = 1,081 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [1,594] => Redis: [total = 1,586 pending = 8] => Elastic: [0] ...
toluaina commented 2 years ago

I am still doing some in depth analysis to asses the cause of any leaks.

Thats a little strange. I just pulled the latest version and all changes were propagated to ES. Is Elasticsearch completely empty? Can you try deleting the state/checkpoint file and re-run. This should be the hidden file in the current directory with the name of the database and index

toluaina commented 2 years ago

Actually I can see the elastic document document count is still at 0. That value should be increasing with each CRUD operation to the database.

bhalonen commented 2 years ago

Found a way to drive memory leaks: incorrect specification in file format, if a "one_to_many" is improperly marked as a "one_to_one". I found this through random stress testing. This makes the leak go at warp speed, but I do not believe that is currently our problem.

I am currently wondering if "many_to_one" is a problem as well. Can for example on a github issue, a comment belongs to one issue, an example of "one_to_many". but one commentor can belong to many issues, what is that classified as in pgsync? Is this relationship "legal"? We do many searches like that, and could explain our leak.

On your questions: ES was not totally empty as I ran a single pass sync version to get some results in. It also didn't put any results in if run without the prior sync. Our typical start up procedure deletes all of the checkpoint files and all elastic data for a clean start up during dev.

this is an example of our most boring possible search index

{
    "database": "steelhead",
    "index": "steelhead_search_part_number",
    "setting": {
      "analysis": {
        "analyzer": {
          "ngram_analyzer": {
            "filter": [
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "ngram_tokenizer"
          }
        },
        "tokenizer": {
          "ngram_tokenizer": {
            "token_chars": [
              "letter",
              "digit",
              "punctuation",
              "symbol"
            ],
            "type": "ngram",
            "min_gram": "2",
            "max_gram": "10"
          }
        }
      },
      "max_ngram_diff": "8"
    },
    "plugins": [
      "IdInDomain"
    ],
    "nodes": {
      "table": "part_number",
      "schema": "steelhead",
      "columns": [
        "name",
        "created_at",
        "description_markdown"
      ],
      "children": [
        {
          "table": "user",
          "schema": "steelhead",
          "columns": [
            "name",
            "id",
            "created_at"
          ],
          "relationship": {
            "variant": "object",
            "type": "one_to_one"
          },
          "children": [
            {
              "table": "domain",
              "schema": "steelhead",
              "columns": [
                "id"
              ],
              "relationship": {
                "variant": "object",
                "type": "one_to_one"
              }
            }
          ]
        }
      ]
    }
  },
toluaina commented 2 years ago

I created a working branch to study this in more detail. I also have a script to insert random records So far memory peaks at about 640MB. I think the relationship you described would effectively be a many to many and many to many are not supported. I'll trying and play around with the relationship types to see the effect.

toluaina commented 2 years ago

Apologies for the delay. This has required a bit of effort. I have identified the major cause for this. SQLAlchemy stores the SQL compilation cache on the engine - SQL Compilation Caching. This is intended for performance reason when subsequent queries are run. I am now clearing this cache on each iteration. This has all been merged into the main branch.

sergiojgm commented 9 months ago

Was this fixed?