Open RubenKelevra opened 4 years ago
@Stebalien please take a look, when you got some time. :)
We need to start much smaller. The first step will be tracking reference counts on-disk instead of having to walk the entire data set each time.
From this proposal, it's unclear how this is all supposed to work in practice, especially how you plan to do this atomically and reliably without losing data. That's the very first part that needs to be fleshed out before we even try to tackle the problem of keeping the most used data.
From this proposal, it's unclear how this is all supposed to work in practice, especially how you plan to do this atomically and reliably without losing data.
Well, I got some more notes on that. I try to simplify here the pipeline:
2^25 / 2
and the cache flag gets set and if no read requests will happen, the block will age out - depending on the cache size, new blocks, etc.This obviously requires some sniffing of pin/mfs operations, so pin/mfs operations will send a list of blocks from time to time to the cache cleaner.
This note (if fully implemented) will likely supersede or implement or fix the following ideas/tickets:
Currently looking at a planned memory footprint for each block of 12 byte total. No pointer overhead, hash table overhead or need to store the CID in memory.
This way a database of 20M blocks would be just 240 MB in memory – however there's a 12 byte overhead per hash-collision. So maybe 1 GB is a conservative, realistic value here.
So the whole database would be written out to disk periodically to recover the cache side of things (age, size, usage counter). However if the database has no clean shutdown flag, the whole pin-tree would need to be rescanned on startup, to update the ref-counter side of things.
@iand you've asked for the feature for your usecase (https://github.com/ipfs/go-ipfs/issues/8870), so here's the
Blockers: After completing https://github.com/ipfs/ipfs-docs/issues/1176, https://github.com/ipfs/go-ipfs-chunker/pull/31 and finding a workaround or a fix for https://github.com/ipfs/go-ipfs/issues/8694 to get pacman.store back up and running I will work full time on this.
Culprit: However no idea on the timespan to completion, I still need to learn a bit golang on the way.
@RubenKelevra thanks, you might want to look at my in-progress implementation as part of https://github.com/ipfs/go-ipfs/issues/8870, although at the moment I have punted all the harder decisions to focus on getting metrics that can be used to assess viability (phase 1 in the linked issue). See https://github.com/ipfs/go-ipfs/compare/master...iand:block-accounting
@RubenKelevra thanks, you might want to look at my in-progress implementation as part of https://github.com/ipfs/go-ipfs/issues/8870, although at the moment I have punted all the harder decisions to focus on getting metrics that can be used to assess viability (phase 1 in the linked issue). See https://github.com/ipfs/go-ipfs/compare/master...iand:block-accounting
Had a look at it and was thinking of it. I think the amount of information you gather is too large.
I mean if you save the fetch time of the last 3 fetches you expect to save the information indefinitely about all discarded blocks?
They idea to save the fetch time is quite interesting, but I think the metric will be pretty much useless in most cases, here's why:
Fetching the first block always takes a lot of time (Time To First Byte), this will dominate the metric on the upper end, so you end up storing just the first blocks of all requests if you weight that too heavy.
Also if you transfer many blocks at the same time, the average received time will be much higher than the optimal fetch-time as the bandwidth is shared.
So I really don't know if that metric will be of much help.
However, I think it's fair to add a flag if a block was hard to fetch, as long as its saved. For example the dial had multiple backoffs or the connection was unreliable for all sources etc.
What do you think about that? :)
Certainly worth thinking about whether we can record certain difficulties with fetches and factor that in. The end result needs to be a score than we can use to remove low value blocks.
Note that the objective here is to reduce the overall time to first byte metric of the gateway, so recording those first block fetch times is high signal for us. There are more notes on the actual project here notion: prodeng project 1
Did you happen to read the caching paper I linked to in https://github.com/ipfs/go-ipfs/issues/8870?
Note that the objective here is to reduce the overall time to first byte metric of the gateway, so recording those first block fetch times is high signal for us.
I think the metric is worthless. If you record the time to fetch the first block and store it, you will retain the first block, but fetching the second block will become the culprit. As the first block is basically just references for datablocks the time to first byte will stay the same.
If you want to improve the response time I see a couple of options:
1) It could make sense to analyze the access patterns on the object level, not block level
This way you can determine which blocks are necessary to fill the first 3 TCP packages in a response with actual data (or something like this) and then mark them with a higher priority than the rest of the blocks. This allows different chunk sizes and data structures like trickle to be considered properly.
2) probabilistic tiering, where a node does push a value into the node's DHT entry which shows the guessed speed of the node. This allows to select the nodes which have currently the most free bandwidth instead of the ones with the lowest ping (or something like random).
3) The distributed dht alike cache retention (of a portion of it) as outlined in a comment on your request. It allows to guesstimate which peer the gateway is currently connected to is most likely to have the block in it's possession. This is pretty important if you run a large cluster of gateways which can hold different portions of cache content this way – without any traffic to organize this.
So you just hold connections between your gateway cluster nodes and data which is requested multiple times over different cluster nodes can be stored on only one node semi-permanent, while blocks with more distance gets dropped more quickly.
So effectively every new node in a cluster of gateways will increase the overall cache performance by increasing the cache size and avoiding duplications of the same blocks on different nodes.
Did you happen to read the caching paper I linked to in ipfs/go-ipfs#8870?
I could not access it. Can you share a CID of it?
This may be relevant https://github.com/ipfs/boxo/issues/364
Did you happen to read the caching paper I linked to in ipfs/go-ipfs#8870?
I could not access it. Can you share a CID of it?
Added the CID in the original comment https://github.com/ipfs/go-ipfs/issues/8870#issuecomment-1112080636
This may be relevant ipfs/boxo#364
Not sure I gonna need that for my implementation. The idea was to just count the space requirements for newly added blocks to the storage and do a tail drop at the same rate.
So there's no need for explicit Garbage Collection invoking, as it is running all the time to keep the storage within the set boundaries.
We will just "flag" blocks which are stale, but not remove them from the storage (yet), just do the decision if they should be removed if there's pressure. (stale cache)
So between sweeps of the database, 5% of the storage can be rewritten.
Short status-update: I'm digging into go right now to see how to implement this "best", meaning low complexity but also very high performance.
Did you happen to read the caching paper I linked to in ipfs/kubo#8870?
The described algorithm seems to be tailored specifically to mutable data under a static URI IPFS has immutable data under a static URI.
So ui and ci doesn't make any sense in our context. I feel like si is similarly useless as well: While it makes sense to take the size of a document into account on a webserver where you may have 1 KB html files and multi gigabyte archives or videos, it doesn't make sense in the application on block level. Sure the IPFS blocks vary in size (that's why I need to keep some track of the size here as well), but I don't use the size as metric as LNC-R-W-U does - for obvious reasons: Even a 1 MB block is much quicker transferred than an mutli-GB file. Latency, DHT discovery and slow start of the TCP/QUIC connection has a much greater impact on the transfer speed than the size.
What fits IPFS block level storage better is Multi-Generational-MRU (that's what Google Engineers recently built for an upcomping version of the Linux kernel[1][2]), and that's basically what this proposal is doing, just with a very large amount of generations while keeping the space requirements as low as possible.
[1] https://lwn.net/Articles/856931/ [2] https://www.phoronix.com/scan.php?page=news_item&px=Multigenerational-LRU-v2
Status
Blockers (to do first):
Todo list
Update draft to include
Introduction
Hey guys,
I'm pretty new to this project, but I see the opportunity to contribute a little and learn some go on the way. I learned C a long time ago, and I'm currently just firm in Python and shell script - so make sure to thoroughly check the contribution.
Automatic cache cleanup
Current approach
There's a garbage collector that is designed to sweep the whole block storage and remove everything which is not part of the MFS and not part of the pins. The garbage collector can be run manually offline, manually online or enabled to run if a specified repo size reached a certain percentage of fill level.
There's no logging of the number of requests of a block, its age nor how often a block is referenced. The garbage collector run is pretty slow since it traverses the whole local database to look for pins and their recursive references as well as the MFS to make sure to drop just non-referenced objects.
New approach
A background task will be added which will use a database to keep track of references, the number of requests, age, and size of each block.
On first initialization (or with a flag) it will traverse the whole pin/MFS database to create its database (or make sure it's consistent).
Goals
(in priority order)
Limitations
(in no particular order)
2^25/(10*(0.1+0.01)/2)/60/60/24
= 706 days - if zero hits occur and given that the storage is large enough to hold it for this timeframe. Objects which exceeds those value, will be placed in the stale section of the cache - which is by default capped at 10% relative size. Blocks in the stale section will be moved back to the main section on requests hitting them - or dropped at any time on cache pressure. (time limit can be raised by switching to 27 bit or by changing the clock interval - default is 10 seconds)Proposed Features
(in no particular order)
Potential configuration knobs
(all sizes are estimates) (relative: excludes pinned/new blocks)
Potential Datastructure
32-bit integer value:
= 0
New block (no reference information yet)The dirty flag marks incomplete writes to recover from them without analyzing the whole data store.
The seldom flag marks if a cache entry is part of the seldom cache. If the reference bit is set, this bit has no function.
The Precaching flag marks if a cache entry is part of the preaching flag. If the reference bit is set, this bit has no function.
Size estimate values
The size bits will be rounded to the nearest value in this list:
Caching algorithm
A newly added block will be added to the new block list and get a timestamp. If the timestamp is older than the latest completed pinning/writing which is completed and the block wasn't referenced, it will be moved to the cache with the default value. The timestamp is discarded.
Probability curve
0.1
.0.01
at0
.0.2
at 2^25 / 2 + 2^25 / 4 (25165824). Exact curve TBD.Left side: