Question about scalability

jkawamoto commented 4 years ago

Do you happen to know how many outputs a Walrus instance can handle? If we have, for example, 10000 outputs, could it work? Bolt DB seems it could handle such large data https://github.com/etcd-io/bbolt#project-status so I hope Walrus also could handle a lot of outputs.

lukechampine commented 4 years ago

I haven't run any serious scalability tests. In terms of "what to worry about," though, the order is:

Addresses
Transactions
Outputs

The number of outputs will grow and shrink over time, but addresses and transactions just accumulate forever. So my goal would be to handle 10M addresses, 1M transactions (2KB each), and 100K outputs.

The most pressing problem today is not boltdb, but the API: there is no pagination, so if it'll return all of your addresses/transactions/outputs. That probably won't scale beyond 100K or so.

jkawamoto commented 4 years ago

Agreed. Addresses are more serious. We're running Walrus for a day with a single output and now it has 10512 addresses. If we use only one output, it wouldn't exceed 10M addresses in 3 years. But, if we create, for example, 1000 outputs, we might exceed 10M addresses in a day...

MeijeSibbel commented 4 years ago

Time to re-open https://gitlab.com/NebulousLabs/Sia/-/issues/1478 ?

jkawamoto commented 4 years ago

This one is for Siad's wallet. I'm not sure if Walrus uses their code, but we cannot run ./siac wallet addresses anymore. So, they also need to fix that problem.

lukechampine commented 4 years ago

We're running Walrus for a day with a single output and now it has 10512 addresses

Yikes. I suppose I could add a "less privacy" setting that reuses addresses. Or maybe, watch for which addresses appear in the chain, and reuse addresses if they haven't shown up in the blockchain after some period of time? Not sure.

But, if we create, for example, 1000 outputs, we might exceed 10M addresses in a day...

To be clear, you get lots of addresses because you create a new one every time you need one. So you can accidentally generate a lot if you're doing integration tests or something like that. Also, when you attempt to form a contract, it'll generate a new address, and that address sticks around even if you fail to form the contract. So the number of addresses isn't strongly correlated with the number of outputs you have.

jkawamoto commented 4 years ago

That's true. We fail to form/renew contracts with the non-existing output error, and it increases the number of addresses. So, hopefully, splitting outputs will reduce such unnecessary addresses.

Regarding reusing addresses, I wouldn't mind if it uses a reused address as the output of a contract forming transaction. If I'm not mistaken, both siad and us don't mix a contract forming transaction and a transfer transaction. That means everyone knows the input and output belong to the same owner. So, it might not be a big problem.

MeijeSibbel commented 4 years ago

I asked Chris for some feedback to understand this problem a bit better, his feedback;

Hm ok you probably want to try !2511 first to see if that helps. He might just be running out of memory due to all the encoding overhead. If that works, you could optimise the endpoint a bit by always keeping a sorted slice of keys in-memory instead of recomputing it every time you call the endpoint. That blows up memory to 2x but never more than that and you also don't need to sort the slice every time.
Finally we might also be able to reduce the in-memory footprint of keys quite a bit by changing how we store them. Right now we precompute them and keep them in a map that maps address -> secret where the address is 32 bytes and the secret is 64 bytes big. It might be good enough to just use the first 8 bytes of the address in the map and only store the 8 byte index instead of the full secret in memory to compute the secret on demand using the seed. That could get us from 96+ bytes per address down to a bit more than 16 bytes. So even with the 2x blowup from before the overall memory footprint would still be smaller by 3x.
siad right now definitely won't scale to a billion addresses. An address uses around 100 bytes in memory so a billion addresses would require 100 GB of ram.
Depending on how you want to query them that's also too many to compute them on demand. The only way to scale to that number is by rewriting the wallet to store the addresses in a database on disk. Ideally something that let's you query them in some sort of order to allow for easy pagination.
Theoretically it's not too bad. The wallet has a keys field which needs to be replaced with some type that is backed by a database but instead of storing the secret, it would just store the index used to derive the address from the seed. I'm pretty sure something like a redis cache would do.

TL;DR: Replace keys with a backed DB on disk (e.g. Redis) but instead of storing the secret, it stores the index used to derive the address from the seed.

Can we join forces and make this happen?

jkawamoto commented 4 years ago

We also need to reuse addresses to keep the number low: https://github.com/lukechampine/us/issues/84.

lukechampine / walrus

Question about scalability #15