status-im / infra-status-legacy

Infrastructure for old Status fleet
https://github.com/status-im/nim-waku
1 stars 3 forks source link

Increase store node retention to 30 days #18

Closed felicio closed 2 years ago

felicio commented 2 years ago

Why

The status products are built around a number of time based message availability assumptions (a legacy of the waku v1 30 day message persistence guarantee). As we move to waku v2, it is vitally important that waku v2 store nodes have the ability to offer the same time based guarantees that waku v1 mailserver nodes offered.

https://docs.google.com/document/d/1it9_HTzOTvumBsUoeY3SEVFTJA-sSYrW89-HsEjriEg/edit?usp=sharing

felicio commented 2 years ago

@jm-clius, apologies. This PR is in the right repo, the other one was in my fork.

felicio commented 2 years ago

@jm-clius please,

will this take effect with the next deployments of

and are any manual db changes necessary?

jm-clius commented 2 years ago

This will take effect with the next deployment of each, yes. There will be no manual changes necessary: the node will simply load with a new retention policy and only delete messages once they're older than 30 days. @LNSD am I missing something here?

LNSD commented 2 years ago

Yes, that's correct. In SQLite-only mode, we apply the time-based retention policy. If a message is older than that, it is removed from the database.

jakubgs commented 2 years ago

This will take effect with the next deployment of each, yes. There will be no manual changes necessary

@jm-clius that is not true.

  1. There is no automation that deploys Ansible changes to fleets.
  2. 30 days retention will not work as the storage available on the nodes is not enough.

@felicio Why was this merged without my input?

jakubgs commented 2 years ago

For example, currently 14 days of messages on node-01.ac-cn-hongkong-c.status.test takes up 16 GB:

admin@node-01.ac-cn-hongkong-c.status.test:~ % du -hs /docker/nim-waku/data
16G /docker/nim-waku/data

admin@node-01.ac-cn-hongkong-c.status.test:~ % df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        40G   26G   13G  68% /

And it has only 13 GB disk space left, so clearly not enough for ~32 GB of messages. There's a reason why the limit was lower.

jakubgs commented 2 years ago

I'm going to undo this commit for now. I need to find the time to modify storage setup on the nodes before this can be applied.

jakubgs commented 2 years ago

Also, one more note. Don't use backticks in commit titles. They are not supposed not be formatted. Just because GitHub can format those titles doesn't mean it's a good idea. Sounds like a great way to have a bad time when automating things in git when titles are involved.

felicio commented 2 years ago

@jakubgs understood, I am sorry and thank you.

LNSD commented 2 years ago

For example, currently 14 days of messages on node-01.ac-cn-hongkong-c.status.test takes up 16 GB:

This is an issue that we are aiming to solve soon. The SQLite vacuuming was disabled, and the space it should take is smaller. I vacuumed the DB copy that @jakubgs provided me some time ago (12GB), and the size went down almost 50% (7.5GB).

cc @jm-clius

jm-clius commented 2 years ago

@jm-clius that is not true.

  1. There is no automation that deploys Ansible changes to fleets.

@jakubgs, ah, thanks! I thought the config changes are applied via some magic. Now I know. :)

@felicio Why was this merged without my input?

Apologies, @felicio, my bad. Should have been clearer that while we can verify config items etc. infra changes generally wait for Jakub's approval (e.g. to verify disk space as is the case here).

jakubgs commented 2 years ago

@felicio shit happens, now you know what the process is. @LNSD I see, that's good to hear. I wasn't aware the difference would be as big as 50%.

If that's the case we can try enabling this on test fleet first, and see how it goes. I will try to get to that today or tomorrow.

jm-clius commented 2 years ago

@jakubgs @felicio we had a discussion about this during our weekly Waku Product call: could we delay increasing the store retention while this investigation is still ongoing: https://github.com/status-im/nwaku/issues/1146 ? @LNSD has a number of store improvements/fixes in mind, and not increasing the retention (for now) will reduce the number of new unknowns.

jakubgs commented 2 years ago

Sounds good.