RubenKelevra commented 4 years ago

Status

Blockers (to do first):

[ ] https://github.com/ipfs/ipfs-docs/issues/1176
[ ] https://github.com/ipfs/go-ipfs-chunker/pull/31
[ ] finding a workaround or a fix for https://github.com/ipfs/go-ipfs/issues/8694 to get pacman.store back up

Todo list

[ ] Create a goal ticket in go-ipfs to track the progress(?) - I asked @Stebalien if that's a good idea

Update draft to include

[ ] DHT-alike caching section for high throughput nodes, outlined here, and originally here.
[ ] more details on database structure, planned hashing algorithms, hash collisions
[ ] Make clear that the server profile in ipfs should turn off the seldom cache section – as it emits a lot of DHT requests
[ ] the details mentioned in the update post
[ ] Cleanup language, restructure a bit
[ ] evaluation how long each operation will take to evaluate where it's benefitial/necessary to create lists/hashtables instead of relying on the flags alone - and if the added complexity makes sense.

Introduction

Hey guys,

I'm pretty new to this project, but I see the opportunity to contribute a little and learn some go on the way. I learned C a long time ago, and I'm currently just firm in Python and shell script - so make sure to thoroughly check the contribution.

Automatic cache cleanup

Current approach

There's a garbage collector that is designed to sweep the whole block storage and remove everything which is not part of the MFS and not part of the pins. The garbage collector can be run manually offline, manually online or enabled to run if a specified repo size reached a certain percentage of fill level.

There's no logging of the number of requests of a block, its age nor how often a block is referenced. The garbage collector run is pretty slow since it traverses the whole local database to look for pins and their recursive references as well as the MFS to make sure to drop just non-referenced objects.

New approach

A background task will be added which will use a database to keep track of references, the number of requests, age, and size of each block.

On first initialization (or with a flag) it will traverse the whole pin/MFS database to create its database (or make sure it's consistent).

Goals

(in priority order)

[ ] The garbage collector remains untouched for full sweeps of the repo
[ ] Timestamp-based concurrency control (won't block any operations of IPFS)
[ ] Crash resistance
[ ] Atomic writes (DB)
[ ] Extremely low memory usage
[ ] No scalability issues
[ ] low chance of cache pollution
[ ] Fast startup
[ ] Permanent storage of block metrics (no warmup needed)
[ ] Avoids small writes to disk
[ ] As many unsynced writes as possible
[ ] Long term cache efficiency
[ ] Avoid redundancies of information
[ ] Low system load (background task)
[ ] Sane and adaptive default config (no tweaking necessary)

Limitations

(in no particular order)

Does not understand anything about the data structure of higher objects
Can't predict access patterns
Can't analyze access patterns
Low level of concurrency on own operations
Can't answer queries for age, hit/miss rate requests on blocks
Can't handle a fixed age
Doesn't care about misses
Size based operations (like drop rate, ingress rate, repo size) are just rough estimates
New blocks cannot be squeezed
Size estimates are limited to 2MB, implementing larger blocks in the protocol would require a change in data structures
maximum references are limited to 25 bit (33,554,432)
- could be theoretically extended to 27 bit (134,217,728) with somewhat more complexity if needed
Doesn't have any metric on blocks which gets unpinned. (All blocks being unpinned/rm from MFS get a standard value)
Long lag between sniffed operations and actions on the cache and DB
Running the GC deletes the cache sweeper DB (and stops the task - if it's running). The cache sweeper database needs to be rebuilt offline to start the task again.
maximum cache time of an item is limited to 2^25/(10*(0.1+0.01)/2)/60/60/24 = 706 days - if zero hits occur and given that the storage is large enough to hold it for this timeframe. Objects which exceeds those value, will be placed in the stale section of the cache - which is by default capped at 10% relative size. Blocks in the stale section will be moved back to the main section on requests hitting them - or dropped at any time on cache pressure. (time limit can be raised by switching to 27 bit or by changing the clock interval - default is 10 seconds)
Extremely long running pinning operations, like days, will block that the new blocks part will be processed at all. Only when the operation is completed, the background task will catch up and move the blocks to either the cache side or the permanent storage side. This might be an issue on very limited space left for the cache, while there are many read operations without pinning. In this case a pinning operation might need to be forcefully stopped if no cache space is left, to clear up this state. Obviously all cached data will be "trashed" this way.
Aborted pinning operations will have all their data left in the cached section of the datastore. Rerunning the pinning operation will reuse the blocks and move them to the new and than to the permanent storage section. If there's a high memory pressure and many "better" valued blocks, aborted pinning operations data, might be removed from the cache before the repinning operation have a chance to claim them. In this case the data has to be fetched again from the network.

Proposed Features

(in no particular order)

[ ] Permanent storage of the database
[ ] Sniffs ipfs operations: Pin, Unpin, write/delete of MFS objects
[ ] optional distributed prefetching
- Sniffs bitswap requests (optional: filtered through a NodeID whitelist) to fetch blocks other nodes are searching for
[ ] seldom segment of the cache
- If enabled: When a block would enter a stale state, the DHT is queried for other providers. If there are none the block will be added to the seldom cache for an extended stale state. If there's a request it will be moved back to the main cache and the metrics will be reset.
[ ] Keeps track of free space left on device and drops blocks if necessary
- Asks periodically the OS for the free space, if there's a shortage the ingress rate on the disk is guessed, while roughly removing its own operations. This fill-rate will be added to the ingress rate of the datastore and adjust the drop rate.
[ ] Each cache segment can be squeezed on the fly with a block count or a size (roughly) (except new blocks)
[ ] Keeps track of the 'new blocks' write speed, and adjusts the drop rate accordingly.
[ ] Keeps track of blocks which can be immediately dropped on a pressure signal
[ ] Keeps track of the block sizes
[ ] Can be adjusted to care more about access requests or more about age

Potential configuration knobs

(all sizes are estimates) (relative: excludes pinned/new blocks)

seldom cache (relative) size (=10%)
stale cache (relative) size (=5%)
warn threshold minimal estimated cache size (=25%)
- log: too many pinned objects/too low disk space
free cache size (=2%)
warn threshold (total) new blocks (=10%)
minimal free disk space (=1 GB)
- action: will drop blocks
max estimated block storage size (=unlimited)
max number of prefetched objects in the cache (=100,000)
max total size of prefetched objects in the cache (=2%)
max number of stale objects (=unlimited)
max total size of stale objects (= 10% cache size)

Potential Datastructure

32-bit integer value:

3-bits for a size estimate (8 values)
1-bit Dirty flag
1-bit Seldom flag
1-bit Precaching flag
1 bit for reference (=true) or cache
25-bits for ref-counter / cache
- = 0 New block (no reference information yet)

The dirty flag marks incomplete writes to recover from them without analyzing the whole data store.

The seldom flag marks if a cache entry is part of the seldom cache. If the reference bit is set, this bit has no function.

The Precaching flag marks if a cache entry is part of the preaching flag. If the reference bit is set, this bit has no function.

Size estimate values

The size bits will be rounded to the nearest value in this list:

    2,500 Bytes - 000
   38,656 Bytes - 001
  136,029 Bytes - 010
  268,924 Bytes - 011
  422,168 Bytes - 100
  639,322 Bytes - 101
1,319,771 Bytes - 110
2,005,897 Bytes - 111

MSP360515i3aae8ec6gcc4700003a2fa1i133a8fiii

Caching algorithm

A clock interval is 10 seconds.
Each access will raise the counter by 1.
- If 2^25 is reached, the value will not be changed.
Each clock interval will decrease the counter by 1 with probability based on the probability-curve.
The default value is 2^25 / 2 (16777216).

A newly added block will be added to the new block list and get a timestamp. If the timestamp is older than the latest completed pinning/writing which is completed and the block wasn't referenced, it will be moved to the cache with the default value. The timestamp is discarded.

|+++++++++++++++++++++++++++++++ Cache ++++++++++++++++++++++++++++++++++++|++++++ New Blocks ++++++|+++++++ Storage +++++++|
||+ Stale +|+Seldom +|+++++++++++++++++ Main +++++++++++++++++++++|+ Pre +||                        |                       |
||         |         |                                            |       ||                        |                       |
||         |         |                                            |       ||                        |                       |
||+++++++++|+++++++++|++++++++++++++++++++++++++++++++++++++++++++|+++++++||                        |                       |
|++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|++++++++++++++++++++++++|+++++++++++++++++++++++|

Probability curve

The default value 2^25 / 2 (16777216) has a probability of 0.1.
The curve on the left is linear and on the right somewhat exponential.
Left side will drop off to 0.01 at 0.
Right side will be at 0.2 at 2^25 / 2 + 2^25 / 4 (25165824). Exact curve TBD.

Left side: MSP44451461d8586c8idg43000047617i73g5f684fa

RubenKelevra commented 4 years ago

@Stebalien please take a look, when you got some time. :)

Stebalien commented 4 years ago

We need to start much smaller. The first step will be tracking reference counts on-disk instead of having to walk the entire data set each time.

From this proposal, it's unclear how this is all supposed to work in practice, especially how you plan to do this atomically and reliably without losing data. That's the very first part that needs to be fleshed out before we even try to tackle the problem of keeping the most used data.

RubenKelevra commented 4 years ago

From this proposal, it's unclear how this is all supposed to work in practice, especially how you plan to do this atomically and reliably without losing data.

Well, I got some more notes on that. I try to simplify here the pipeline:

All blocks with 0 references are 'new blocks'.
If a new block get's added, it's saved with 0 references plus a timestamp in a timestamp list.
If a pin/unpin/write/rm operation starts, it's get added to an add/remove operation list, with a timestamp.
When the timestamp of a new block gets older than the oldest still running operation, and the reference counter is still 0 the block wasn't added to a pinset or to the MFS.
In this case, the default value of 2^25 / 2 and the cache flag gets set and if no read requests will happen, the block will age out - depending on the cache size, new blocks, etc.

This obviously requires some sniffing of pin/mfs operations, so pin/mfs operations will send a list of blocks from time to time to the cache cleaner.

RubenKelevra commented 4 years ago

This note (if fully implemented) will likely supersede or implement or fix the following ideas/tickets:

[ ] #130
[ ] #147
[ ] https://github.com/ipfs/go-ipfs/issues/3092
[ ] https://github.com/ipfs/go-ipfs/issues/6783
[ ] https://github.com/ipfs/go-ipfs/issues/5773
[ ] https://github.com/ipfs/go-ipfs/issues/3679
[ ] https://github.com/ipfs/go-ipfs/issues/3333
[ ] https://github.com/ipfs/go-ipfs/issues/8870

RubenKelevra commented 2 years ago

Update

Currently looking at a planned memory footprint for each block of 12 byte total. No pointer overhead, hash table overhead or need to store the CID in memory.

This way a database of 20M blocks would be just 240 MB in memory – however there's a 12 byte overhead per hash-collision. So maybe 1 GB is a conservative, realistic value here.

So the whole database would be written out to disk periodically to recover the cache side of things (age, size, usage counter). However if the database has no clean shutdown flag, the whole pin-tree would need to be rescanned on startup, to update the ref-counter side of things.

@iand you've asked for the feature for your usecase (https://github.com/ipfs/go-ipfs/issues/8870), so here's the

Current status

Blockers: After completing https://github.com/ipfs/ipfs-docs/issues/1176, https://github.com/ipfs/go-ipfs-chunker/pull/31 and finding a workaround or a fix for https://github.com/ipfs/go-ipfs/issues/8694 to get pacman.store back up and running I will work full time on this.

Culprit: However no idea on the timespan to completion, I still need to learn a bit golang on the way.

iand commented 2 years ago

@RubenKelevra thanks, you might want to look at my in-progress implementation as part of https://github.com/ipfs/go-ipfs/issues/8870, although at the moment I have punted all the harder decisions to focus on getting metrics that can be used to assess viability (phase 1 in the linked issue). See https://github.com/ipfs/go-ipfs/compare/master...iand:block-accounting

RubenKelevra commented 2 years ago

@RubenKelevra thanks, you might want to look at my in-progress implementation as part of https://github.com/ipfs/go-ipfs/issues/8870, although at the moment I have punted all the harder decisions to focus on getting metrics that can be used to assess viability (phase 1 in the linked issue). See https://github.com/ipfs/go-ipfs/compare/master...iand:block-accounting

Had a look at it and was thinking of it. I think the amount of information you gather is too large.

I mean if you save the fetch time of the last 3 fetches you expect to save the information indefinitely about all discarded blocks?

They idea to save the fetch time is quite interesting, but I think the metric will be pretty much useless in most cases, here's why:

Fetching the first block always takes a lot of time (Time To First Byte), this will dominate the metric on the upper end, so you end up storing just the first blocks of all requests if you weight that too heavy.

Also if you transfer many blocks at the same time, the average received time will be much higher than the optimal fetch-time as the bandwidth is shared.

So I really don't know if that metric will be of much help.

However, I think it's fair to add a flag if a block was hard to fetch, as long as its saved. For example the dial had multiple backoffs or the connection was unreliable for all sources etc.

What do you think about that? :)

iand commented 2 years ago

Certainly worth thinking about whether we can record certain difficulties with fetches and factor that in. The end result needs to be a score than we can use to remove low value blocks.

Note that the objective here is to reduce the overall time to first byte metric of the gateway, so recording those first block fetch times is high signal for us. There are more notes on the actual project here notion: prodeng project 1

Did you happen to read the caching paper I linked to in https://github.com/ipfs/go-ipfs/issues/8870?

RubenKelevra commented 2 years ago

Note that the objective here is to reduce the overall time to first byte metric of the gateway, so recording those first block fetch times is high signal for us.

I think the metric is worthless. If you record the time to fetch the first block and store it, you will retain the first block, but fetching the second block will become the culprit. As the first block is basically just references for datablocks the time to first byte will stay the same.

If you want to improve the response time I see a couple of options:

1) It could make sense to analyze the access patterns on the object level, not block level

This way you can determine which blocks are necessary to fill the first 3 TCP packages in a response with actual data (or something like this) and then mark them with a higher priority than the rest of the blocks. This allows different chunk sizes and data structures like trickle to be considered properly.

2) probabilistic tiering, where a node does push a value into the node's DHT entry which shows the guessed speed of the node. This allows to select the nodes which have currently the most free bandwidth instead of the ones with the lowest ping (or something like random).

3) The distributed dht alike cache retention (of a portion of it) as outlined in a comment on your request. It allows to guesstimate which peer the gateway is currently connected to is most likely to have the block in it's possession. This is pretty important if you run a large cluster of gateways which can hold different portions of cache content this way – without any traffic to organize this.

So you just hold connections between your gateway cluster nodes and data which is requested multiple times over different cluster nodes can be stored on only one node semi-permanent, while blocks with more distance gets dropped more quickly.

So effectively every new node in a cluster of gateways will increase the overall cache performance by increasing the cache size and avoiding duplications of the same blocks on different nodes.

Did you happen to read the caching paper I linked to in ipfs/go-ipfs#8870?

I could not access it. Can you share a CID of it?

iand commented 2 years ago

This may be relevant https://github.com/ipfs/boxo/issues/364

iand commented 2 years ago

Did you happen to read the caching paper I linked to in ipfs/go-ipfs#8870?

I could not access it. Can you share a CID of it?

Added the CID in the original comment https://github.com/ipfs/go-ipfs/issues/8870#issuecomment-1112080636

RubenKelevra commented 2 years ago

This may be relevant ipfs/boxo#364

Not sure I gonna need that for my implementation. The idea was to just count the space requirements for newly added blocks to the storage and do a tail drop at the same rate.

So there's no need for explicit Garbage Collection invoking, as it is running all the time to keep the storage within the set boundaries.

We will just "flag" blocks which are stale, but not remove them from the storage (yet), just do the decision if they should be removed if there's pressure. (stale cache)

So between sweeps of the database, 5% of the storage can be rewritten.

Short status-update: I'm digging into go right now to see how to implement this "best", meaning low complexity but also very high performance.

RubenKelevra commented 2 years ago

New data structure

The database will be broken up into 256 sorted lists by the sip-hash of the CID.
The 56 remaining bits will be stored in an uint64
the highest 8 bits are free for flags, so we use them for the flags discussed before (with some changes):
- bit 1: deleted
- bit 2: stale
- bit 3: seldom
- bit 4: reserved
- bit 5: reserved
- bit 6: 3 bit size estimate
- bit 7: 3 bit size estimate
- bit 8: 3 bit size estimate
The 32 bit value behind will become an int32. Negative is caching, positive is ref-counting, 0 is new-block.

RubenKelevra commented 2 years ago

Did you happen to read the caching paper I linked to in ipfs/kubo#8870?

The described algorithm seems to be tailored specifically to mutable data under a static URI IPFS has immutable data under a static URI.

So ui and ci doesn't make any sense in our context. I feel like si is similarly useless as well: While it makes sense to take the size of a document into account on a webserver where you may have 1 KB html files and multi gigabyte archives or videos, it doesn't make sense in the application on block level. Sure the IPFS blocks vary in size (that's why I need to keep some track of the size here as well), but I don't use the size as metric as LNC-R-W-U does - for obvious reasons: Even a 1 MB block is much quicker transferred than an mutli-GB file. Latency, DHT discovery and slow start of the TCP/QUIC connection has a much greater impact on the transfer speed than the size.

What fits IPFS block level storage better is Multi-Generational-MRU (that's what Google Engineers recently built for an upcomping version of the Linux kernel[1][2]), and that's basically what this proposal is doing, just with a very large amount of generations while keeping the space requirements as low as possible.

[1] https://lwn.net/Articles/856931/ [2] https://www.phoronix.com/scan.php?page=news_item&px=Multigenerational-LRU-v2

ipfs / notes

[Draft] A cache sweeper for kubo (go-ipfs) #428

Status

Todo list

Update draft to include

Introduction

Automatic cache cleanup

Current approach

New approach

Goals

Limitations

Proposed Features

Potential configuration knobs

Potential Datastructure

Size estimate values

Caching algorithm

Probability curve

Update

Current status

New data structure