gristlabs / grist-core

Grist is the evolution of spreadsheets.
https://www.getgrist.com/
Apache License 2.0
6.97k stars 308 forks source link

Huge sqlite database when writing via API calls #1163

Open MHOOO opened 4 weeks ago

MHOOO commented 4 weeks ago

Describe the current behavior

I have a python backend service that continuously updates tables with new data on a self-hosted grist-core instance via docker. Over the past months, this has apparently resulted in an SQLite DB that is over 900MB large without snapshots. I could now get rid of most history states via the /{did}/states/remove endpoint, resulting in an SQLite DB that is only 500 KB large. Is there some way to disable history writing when using the Python / REST API or have it automatically clean up states every now & then?

Steps to reproduce

  1. Startup any grist core instance via e.g. docker
  2. Spam row updates
  3. See your SQLite DB grow infinitely!

Describe the expected behavior

SQLite DB should not grow infinitely on row updates

Where have you encountered this bug?

Instance information (when self-hosting only)

MHOOO commented 4 weeks ago

Oh, another thing I noticed, is the grist-sessions.db growing upto 900MB as well. However I could just delete it and grist ran as normal. What is stored inside the table for it to grow this large? Am I potentially creating new sessions with every REST API call?

fflorent commented 4 weeks ago

Hello,

Is there some way to disable history writing when using the Python / REST API or have it automatically clean up states every now & then?

Not yet, but I think your request is already tracked here (when you say "have it automatically clean up states every now & then"): https://github.com/gristlabs/grist-core/issues/1121

I tend to think it should be quite easy and straightforward to implement it (not 100% sure), if you want to and have time to take a look. I would be glad to help if so!

MHOOO commented 4 weeks ago

I guess exposing ACTION_HISTORY_MAX_ROWS & ACTION_HISTORY_MAX_BYTES as environment Variables should work, right? I'm amazed I did not reach the 1000 rows, since that would be only 1000s in my case (I have a status table that I update every second).

fflorent commented 4 weeks ago

I would say so. As stated Paul, we should also check the value and ensure it is strictly positive:

 Pruning to 0 would currently leave the document unusable, since in there is currently no way to work with a document with a null "current last action". I'm sure this could be overcome. An alternative could be to add an empty action and then prune everything before it.

I tend to think we may also let the user configure the other variables too, without much opinion.

If you agree that https://github.com/gristlabs/grist-core/issues/1121 would solve your issue, would you agree to close your issue and continue the discussion there?

MHOOO commented 4 weeks ago

Yes, that would work for me. Though I would still like to know what caused grist-sessions.db to grow this large. Am I potentially misusing the system by creating a new session with every request?

fflorent commented 4 weeks ago

No, that's because of something you already pointed out:

I have a python backend service that continuously updates tables with new data on a self-hosted grist-core instance via docker

Every action you make on the document, whether it is through the UI or the API, adds a new history entry, explaining why with how you use the document it went so big.

MHOOO commented 3 weeks ago

Oh, I was under the assumption that the history was stored inside the actual document SQLite DB and not the grist-sessions.db. OK, good to know

paulfitz commented 3 weeks ago

Oh, I was under the assumption that the history was stored inside the actual document SQLite DB and not the grist-sessions.db

That's correct, history is stored inside the SQLite DB, in the _gristsys_ActionHistory and _gristsys_ActionHistoryBranch tables.

The grist-sessions.db table stores session information, when Redis is not available. It does sound like you might be getting a lot of session churn somehow? Are the requests made anonymously or with an api key?

fflorent commented 3 weeks ago

I have missed you talked about grist-sessions.db and not the document, sorry for the confusion @MHOOO

MHOOO commented 1 week ago

The grist-sessions.db table stores session information, when Redis is not available. It does sound like you might be getting a lot of session churn somehow? Are the requests made anonymously or with an api key?

I have set GRIST_API_KEY to a valid key, so these should be made non-anonymously