Sharded directory fetching is unusably slow

ajbouh commented 6 years ago

Version information:

0.4.15-dev

Type:

Bug/performance issue

Description:

More context is available over in https://github.com/tesserai/iptf/issues/2

I'm trying to get reasonable performance for just listing the names of entries in a sharded directory that's not yet cached locally. This operation takes hours right now. With @Stebalien's help I've been able to determine that it's only requesting one hash at a time (as indicated by ipfs bitswap wantlist.

Seems like IPFS should be requesting more than one block at a time in this scenario. Creating a separate issue to track this specific performance issue separately from others.

child of #5487

kevina commented 6 years ago

This seams like an easy enough fix, so I will look into it. If someone beats me to it please remove my assignment.

Stebalien commented 6 years ago

@kevina have fun :smile:. Unfortunately, it's actually a bit frustrating. Parallelizing fetching all the children of a single node is simple however, many of the nodes deep in sharded directory trees only have a few children so the speedup is a bit depressing.

At the end of the day, it becomes a memory/parallelism + throughput/latency tradeoff.

ajbouh commented 6 years ago

But parallelization of children nodes should address the scenario with 1e6 children at a single level, right?

On Wed, Apr 25, 2018, 23:27 Steven Allen notifications@github.com wrote:

@kevina https://github.com/kevina have fun 😄. Unfortunately, it's actually a bit frustrating. Parallelizing fetching all the children of a single node is simple however, many of the nodes deep in sharded directory trees only have a few children so the speedup is a bit depressing.

At the end of the day, it becomes a memory/parallelism + throughput/latency tradeoff.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ipfs/go-ipfs/issues/4908#issuecomment-384439885, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcnfmMEQvXFRgNCSl-45m6jugLW7g5ks5tsOpJgaJpZM4TENgI .

Stebalien commented 6 years ago

@ajbouh due to sharding, we have at most 256 children at each level. Fetching 256 at a time is great however, many of the deeper (partially filled) nodes in the tree end up with 5-10 children.

ajbouh commented 6 years ago

But right now we fetch only one at a time, so isn't that a 5-256x improvement? Initial fetch of ImageNet took hours...

On Thu, Apr 26, 2018, 01:40 Steven Allen notifications@github.com wrote:

@ajbouh https://github.com/ajbouh due to sharding, we have at most 256 children at each level. Fetching 256 at a time is great however, many of the deeper (partially filled) nodes in the tree end up with 5-10 children.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ipfs/go-ipfs/issues/4908#issuecomment-384467057, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcnV9IDrQQl8Vgd6NtEsHq6dhttj_9ks5tsQlRgaJpZM4TENgI .

Stebalien commented 6 years ago

@ajbouh in practice, more like 4x. Definitely an improvement but we can do much better.

kevina commented 6 years ago

Yeah we need to be reading the blocks as they come in from the network, and then fetching any other needed blocks in parallel. This should be possible, but I have not looked into the code yet. However, we would need to limit the number of requests fetched in parallel somehow.

@Stebalien do you have some good test hashes?

Stebalien commented 6 years ago

@kevina I just created a large directory with tiny files locally and tested with iptb. I find that's generally the best way to make a reproducible test.

kevina commented 6 years ago

@Stebalien where did the 4x number come from?

Stebalien commented 6 years ago

@kevina most of the shards had few directories and we'd wait until we'd downloaded all of them before moving on. This gives us a sawtooth pattern where we were often only downloading a few stragglers.

kevina commented 6 years ago

@ajbouh #4979 should help significantly

ajbouh commented 6 years ago

@kevina excellent! Have you tried to ls the ImageNet CID with this change?

kevina commented 6 years ago

Yes, That directory is huge and even with batching there are still a huge number of network requests, so I have not let it complete. I am doing that now and will report back, but I encourage you to try it out also.

ajbouh commented 6 years ago

Is someone tracking the optimization work needed to get this ls operation to work in a reasonable amount of time? I'm not talking about doing a get... Just an ls...?

kevina commented 6 years ago

@ajbouh it just finished, it completed in around 30 minutes, not great but better. There are around 1281167 entries consisting of around 112220 blocks. That a lot of blocks to retrieve so I am not sure how much better we can do. The p.r. retrieves the blocks in batch sizes up to 320 (see code for reason for this number) and it seamed to be taxing the resources on my machine so I am not sure how much larger I want to make this number.

ajbouh commented 6 years ago

@kevina I'm not sure we're talking about the same CID here. I'm talking about one with ~10^6 entries? Is it easy to determine how many bytes are required to represent the sharded directory?

It seems we should expect it to go as fast as an ipfs get of a file of the same size, no?

kevina commented 6 years ago

I am testing:

ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1

My initial numbers where wrong so I updated the count.

It is not the size that is important but the number of blocks that need to be retrieved. With hamt sharding of a directory object the block size is likely to smaller than with normal sharding of a file which is broken up into equal size segments (of which I forgot the exact number but I think its around 43k).

ajbouh commented 6 years ago

I see, so perhaps sharded directories just aren't designed for this use case and we should be thinking about using something else?

We need to be able to quickly enumerate all entries so we can decide which to fetch next. Perhaps a single manifest file with a known name is the easiest way to accomplish this?

kevina commented 6 years ago

@ajbouh perhaps, however the number of blocks required is also really high.

@whyrusleeping @Stebalien thoughts?

whyrusleeping commented 6 years ago

Investigating...

whyrusleeping commented 6 years ago

@kevina's code looks reasonable. Probably want to combine that with bitswap sessions and a higher bitswap activeWants count. Once concurrency of fetching is no longer the issue, there are other optimizations to look at, namely requester side batching of blocks that we receive. Right now every block we get through bitswap gets put to the datastore individually, batching those together could add some significant improvements.

In any case, @ajbouh do you need the entire list of names for your operation? Listing 10 million directory entries is going to be slow (order of tens of seconds) unless we work some fancy caching magic. Maybe theres a better way we can query this information?

ajbouh commented 6 years ago

Yes, I need to stream through all entries in a directory, batching, sampling and shuffling them in a consistent and user-specifiable manner.

@whyrusleeping what are you thinking the primary bottleneck is?

If we're talking about 1M entries that each need about 100 bytes, that's only a 100MB total download. This seems like something that we should be able to do in 10 seconds or less on a fast connection. If it's already on the local disk it should be even faster.

What am I missing here?

Stebalien commented 6 years ago

@kevina were you using iptb on a separate network when you tested that (i.e., would bitswap sessions have affected it)?

kevina commented 6 years ago

@Stebalien I was not even using iptb, just testing it from my computer.

Stebalien commented 6 years ago

@kevina could you run a quick test with iptb? That'll tell us how much bitswap sessions would help and how much, e.g., network latency/bandwidth affect it.

kevina commented 6 years ago

@Stebalien can you be a little more specific on what different combinations you want to test?

kevina commented 6 years ago

@Stebalien okay, I tested commit 3f79eab844eaeb1a05332fec59166c3e6e51cb39 in pr #4979 as I think you wanted. I started a iptb testbad with just 2 nodes and connected them and then ran

time ./iptb run 0 ipfs ls --resolve-type=false QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1 > /dev/null

It took 14m40s.

The second ipfs node in the cluster already has the hash and all the parent shards.

whyrusleeping commented 6 years ago

@kevina how long does it take to run that when you already have all the blocks?

Also, are these nodes using badger or flatfs?

kevina commented 6 years ago

@whyrusleeping its takes 28s when the node already has all the blocks.

It is using the default, which is flatfs. I will look into using badger instead.

kevina commented 6 years ago

Using badger instead: 1m36s.

Using flatfs with sync set to false: 1m20s.

So its not badger vs flatfs it the sync option in flatfs that is doing it that forces a call to fsync after each key is added.

whyrusleeping commented 6 years ago

@kevina and the operation on badger when all blocks are local?

kevina commented 6 years ago

It takes 26s when on badger when all blocks are local.

kevina commented 6 years ago

Note with all these tests only the node where ipfs ls was called was switched between badger and flatfs, the other node stayed with flatfs. I doubt it made much of a difference (maybe a few seconds), bit if important enough I can test this tomorrow.

Also as a side note: ipfs repo gc seams to be an order of magnitude slower with badger than with flatfs so it makes cleaning up and redoing the tests time consuming.

ajbouh commented 5 years ago

@stebalien @whyrusleeping @kevina what's the latest on this thread? Thanks for all your efforts so far!

whyrusleeping commented 5 years ago

@ajbouh sorry for dropping this thread, thanks for the ping.

If we're talking about 1M entries that each need about 100 bytes, that's only a 100MB total download. This seems like something that we should be able to do in 10 seconds or less on a fast connection. If it's already on the local disk it should be even faster.

What am I missing here?

In terms of raw data, yes, thats all thats needed, and if each of these entries was precomputed and formatted and stored in a single file, it would be trivial to stream that over at line rate. But, with ipfs (and other filesystems) this data is stored in a graph, and with its own format. each object has to be parsed, read, and then from there other objects gets loaded and parsed, and so on. So its not quite as simple as just having 100MB of data to stream back. That said, we should definitely still be able to get some really fast speeds on this command, it will just take tweaking.

One thing to make sure of here, you don't expect the entries to be sorted, do you?

whyrusleeping commented 5 years ago

@ajbouh it would also be very helpful if you could give us some target to hit that would be acceptable, like "be able to ls this directory over LAN in under 10 seconds" or something. Something we can easily measure progress against, ya know?

ajbouh commented 5 years ago

@whyrusleeping Good point about making things easy to measure.

Being able to ls the directory (with 10^6 files in it) over LAN in under 10 seconds is a great starting point. As is < 1 second to see first entry in the directory.

What else can I provide to help?

whyrusleeping commented 5 years ago

@ajbouh Any other nicely measurable perf requirements you can think of are definitely appreciated, but I think this is enough to go on.

Things to note, it may be easiest to make a separate ipfs fast-ls command, or add a streaming option to ipfs ls. It currently blocks until it has all the entries, and then prints them out.

ajbouh commented 5 years ago

Yeah, I think I'm using the streaming API under the hood.

For context: this is part of a larger goal to train a state of the art machine learning model from your laptop with Google's TPUs.

Would much rather use IPFS for this as using cloud storage makes working with open source folks very difficult. Is also makes working from your laptop much harder.

TPUs are approximately $1 for 10 minutes of use. Getting the overhead of data loading/fetching to be just a few seconds is absolutely critical. For clarity, cloud storage has essentially zero up-front overhead for already-hosted datasets.

Looking forward to getting this figured out!

whyrusleeping commented 5 years ago

@ajbouh thats really cool! Let's get this train moving then :)

Yeah, I think I'm using the streaming API under the hood.

Unless youre running custom ipfs code, I don't think youre getting what you think you are. In ls here: https://github.com/ipfs/go-ipfs/blob/master/core/commands/ls.go#L170 It collects all the results up, and the outputs them all at once. I threw together a quick PoC of a fully streaming ls command here: https://github.com/ipfs/go-ipfs/compare/hack/fastls?expand=1 We should think about how to integrate that properly.

ajbouh commented 5 years ago

Correction, not using the streaming API just yet, but we are using custom code.

https://github.com/tesserai/iptf/issues/3#L143

That said, TensorFlow's own directory listing logic is not streaming, so some creativity will be required on my part for some operations: https://github.com/tensorflow/tensorflow/blob/e7f158858479400f17a1b6351e9827e3aa83e7ff/tensorflow/core/platform/file_system.h#L116

Agreed on getting the train moving!

Based on other threads, it seems like badger isn't a short term option. Who has the baton for this right now?

eingenito commented 5 years ago

@ajbouh @hannahhoward has picked this back up. See the linked issue for more details.

hannahhoward commented 5 years ago

@ajbouh I am not sure if we've cut a new release since https://github.com/ipfs/go-unixfs/pull/19 was merged but I'd be curious to hear how this affects your performance

ajbouh commented 5 years ago

Thanks for the ping, @hannahhoward

I am also curious about the performance but have not tried a recent build myself. Have you tried the operations I referenced in https://github.com/tesserai/iptf/issues/2

They were with the CID QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1

Stebalien commented 5 years ago

@hannahhoward we haven't.

eingenito commented 5 years ago

I believe this has been addressed in 0.4.18. Please reopen if needed.

ajbouh commented 5 years ago

Hi @eingenito! Have you tried the operations I referenced in tesserai/iptf#2

They were with the CID QmXNHWdf9qr7A67FZQTFVb6Nr1Vfp4Ct3HXLgthGG61qy1

Stebalien commented 5 years ago

The first part was addressed in 0.4.18 but the second part is the sessions improvements that'll land in 0.4.19. Hopefully that'll be sufficient.

ajbouh commented 5 years ago

Will the session stuff address the duplicated blocks issue?

Stebalien commented 5 years ago

Significantly but it's still far from perfect.

ipfs / kubo