In the mixed file size scenario bandwidth is very underutilized - enterprise setup

mrow4a commented 7 years ago

Setup information

Target server is enterprise scale setup, meaning so called "1 Byte PUT duration" takes relatively long compared to home scale server. This is also server on which "100MB PUT duration" on high-bandwidth client takes relatively faster compared to home scale server.

Scenario 1

Imagine you are just doing your sync after a time of longer inactivity on one of your devices. You used your web interface, mobile and other devices - e.g. work computer - to add your files. You have many small files below 1MB to be uploaded/downloaded, many between 1MB and 5MB, and some over 5MB. Your folder is usualy structured that folders contain usualy files of similar file sizes, with few exceptions. Number of the folders containing changes, is small. Generalizing, typical user case.

Scenario 2

Imagine you are just doing your initial sync on one of the sync clients. Imagine, that your initial sync folder contains 55000 files and 50GB of data. You have around 20000 files under 1MB, 35000 files over 1MB. This is contained in around 500 folders

Actual behaviour

If sync client finds folder with many >1MB files, everything is fine. However, when a sync client visits a folder with many <1MB files, small files PUT takes relatively long, bandwidth is underutilized, since in the same time "raw bytes data" transfer could be performed of other bigger files.

Solution

The solution is a continuation of Request Scheduler PRs on the client: Folder items scheduler - evaluation attribute sorted tree

In the implementation, Request Scheduler will construct 2(3) separate queues (robins) in each of the folders to synchronise:

list of all files under 1MB
list of all files over 1MB
(non files (MKDIR, DELETE.. ) ?)

Propagator, which is class responsible for scheduling items, will contain new value - queueRobin. Directory, which is class responsible for dispatching items within folder that contain changes, will contain new value - minCurrentFilesSize.

Request scheduler will perform round robin on each of the queues, visiting them sequentially within one Directory class:

if scheduler finds in a given directory a file in a queue pointed by robin and the file is not currently in flight, file is processed and robin switched to next robin.
if scheduler finds in a given directory a file in a queue pointed by robin and the file is currently in flight, robin is switched and scheduler will check another queue.
if scheduler does not find in a given directory a file in the following robin pointed queue, it calls scheduleNext, to ask another folder if it has the file
if scheduler does not find a folder which satisfies its robin (e.g. >1MB) because their minCurrentFilesSize's currently are smaller, it will exclude robin from further search based on it. (Only <1MB are left)

Advantages

There would be no difference syncing 1 or 10 files at once.
There would be no difference syncing significant number of files of the same filesize.
Bandwidth will be nicely utilized and sync time reduced, thus during the "server bookkeeping time", some Bytes of another file will be transferred (in-flight), if number of files is significant and their file sizes are mixed.

Above situation will never happen and other bigger files will be transferred (if there are any of course satisfying the robin).
We can close this issue, https://github.com/owncloud/client/issues/5390, because 1MB files will already show nice approximation if they are available or not. (if they are available, transfer of 1MB will show bandwidth of client plus bookkeeping, if they are not, it will show bookeeping only, since 1MB is under 1s for most of the link bandwidths out there on desktop)
This can be very powerful combining with Bundling https://github.com/owncloud/client/pull/5319
This can be very powerful combining with Dynamic Chunking https://github.com/owncloud/client/pull/5368
This can be very powerful combining with Priority Scheduling https://github.com/owncloud/client/pull/5349

Disadvantages

if sync breaks in the middle, and connection is not available for longer time, there could be many folder s with many unfinished jobs on our remote server. Sync is not "Directory wise" synced, meaning directories are not synced one by another, sequentially, but items are synced in general.
can we really sync items from many folders? Do we need to create another queue for not PUT/GET files like MKDIR, DELETE etc, so that it will be executed in all folders first? This of course can be resolved also with help of priority scheduling.

Assumption

Why do you think that this exact strategy would improve bandwidth util. and sync speed? Did you >compare with some performance tests? Please share.

Lets do some maths. This will be our folder to sync: 100 files - average 100kB file size -> total 10MB to be transfered 10 files - average 10MB file size -> total 100MB to be transfered

Assume, that your network is 5MB/s and one "1 Byte PUT" takes 1s (we are not in ideal home case scenario with empty server now guys, it is not that easy, it could take much much more)

The request "latency" consists of 2 components, time it takes for bookkeeping on server and time it takes for data transfer.

Current case

If you do 100 small request in the row, your data transfer is neglible ~0s and bookkeeping time is 100s. Parallelising that maybe you can achieve 33s having 3 parallel flows.

If you do 10 biger files request in a row, your data transfer is 10s bookkeeping. Parallelised say 4s if you are lucky with 3 parallel flows. However, you cannot omit 20s coming from transfer of 100MB, does not matter how many requests you have in parallel 1000 or 1. Your 5MB/s office net bounds you.

Total you need around 33s plus 24s from big, having 57s.

Optimized case

If you do 100 small request in the row, your data transfer is neglible ~0s and bookkeeping time is 100s. Parallelising that maybe you can achieve 50s reserving 2 flow slots. In this 50s, you used neglible bandwidth. If you use the 3rd flow to pump there 30s comming from big files, you just synced your files in 50s, since your 30s are done in parallel filling the bandwidth.

57s vs 50s is like 13% of the time orginal time? In this example it is around 7s. 100 files is not a big deal but shows bigger picture

More big files and more small files, the different is the percentage. Do the math for 55000 files and 55GB. Where you have 40000 of small files in 10GB(avg. size 250kB) and 15000 in the rest 45GB (avg size 3MB). I think we wont be talking about even minutes there :>

Why 1MB?

Look above, tried to find a definition of small file

@hodyroff @ogoffart @jturcotte @guruz @butonic @felixboehm @DeepDiver1975 @cdamken

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/40190729-in-the-mixed-file-size-scenario-bandwidth-is-very-underutilized-enterprise-setup?utm_campaign=plugin&utm_content=tracker%2F216457&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F216457&utm_medium=issues&utm_source=github).

mrow4a commented 7 years ago

First questions I would love to ask:

Is there any advantage of having sync which completes directories one by one? Only I can see right now it is just simple algorithm out there.
@jnweiger Is rsync doing that also "directory" wise or do they take into account other constraints also? As far as I know they are interested only in blocks.
@pmaier1 Do you know how important for the user could be to have sync being completed directory by directory? Does he really care? Taking into account 5s sync and 5h sync. I know this will be very rare case possible from the implementation above, just ask by curiosity.

jnweiger commented 7 years ago

Regarding rsync: There are many options how rsync can be run.

The very common rsync -va only does a trivial size, name and timestamp comparison.
The more expensive rsync -vac would compare checksums of 1k blocks in the file.
This can be expanded to do a fuzzy match with rsync -vacy so that it looks around in the same directory for files that are 'similarily named' in the hope to find needed blocks there.
there is also a --compare-dest=OTHERDIR option which is a hint that helpful blocks are found in files of the OTHERDIR. I've never seen that in action.

mrow4a commented 7 years ago

@hodyroff I understand your thumb down as it is not important if this is directory ordered or not :>

@jnweiger Ok, so it is cross-directory basically, and it is interested in blocks.

felixboehm commented 7 years ago

Explained in simple words what you do: Sync big and small files alternated. Right? Adding complexity and IMHO loss of stability by not syncing folder by folder (see disadvantages).

Why do you think that this exact strategy would improve bandwidth util. and sync speed? Did you compare with some performance tests? Please share.
Why 1MB?
"very powerful with bundling", Why? I thought you transfer always chunks of same size with bundling. So this concept would not be needed at all.

Does Bundling work folder by folder, or did you extend bundling to work across folders?

mrow4a commented 7 years ago

@felixboehm @hodyroff

Explained in simple words what you do: Sync big and small files alternated. Right? Adding complexity and IMHO loss of stability by not syncing folder by folder (see disadvantages).

This is the trade-off yes.

Assumption

Why do you think that this exact strategy would improve bandwidth util. and sync speed? Did you >compare with some performance tests? Please share.

Lets do some maths. This will be our folder to sync: 100 files - average 100kB file size -> total 10MB to be transfered 10 files - average 10MB file size -> total 100MB to be transfered

Assume, that your network is 5MB/s and one "1 Byte PUT" takes 1s (we are not in ideal home case scenario with empty server now guys, it is not that easy, it could take much much more)

The request "latency" consists of 2 components, time it takes for bookkeeping on server and time it takes for data transfer.

Current case

If you do 100 small request in the row, your data transfer is neglible ~0s and bookkeeping time is 100s. Parallelising that maybe you can achieve 33s having 3 parallel flows.

If you do 10 biger files request in a row, your data transfer is 10s bookkeeping. Parallelised say 4s if you are lucky with 3 parallel flows. However, you cannot omit 20s coming from transfer of 100MB, does not matter how many requests you have in parallel 1000 or 1. Your 5MB/s office net bounds you.

Total you need around 33s plus 24s from big, having 57s.

Optimized case

If you do 100 small request in the row, your data transfer is neglible ~0s and bookkeeping time is 100s. Parallelising that maybe you can achieve 50s reserving 2 flow slots. In this 50s, you used neglible bandwidth. If you use the 3rd flow to pump there 30s comming from big files, you just synced your files in 50s, since your 30s are done in parallel filling the bandwidth.

57s vs 50s is like 13% of the time orginal time? In this example it is around 7s. 100 files is not a big deal but shows bigger picture

More big files and more small files, the different is the percentage. Do the math for 55000 files and 55GB. Where you have 40000 of small files in 10GB(avg. size 250kB) and 15000 in the rest 45GB (avg size 3MB). I think we wont be talking about even minutes there :>

Why 1MB?

Look above, tried to find a definition of small file

"very powerful with bundling", Why? I thought you transfer always chunks of same size with bundling. >So this concept would not be needed at all. Does Bundling work folder by folder, or did you extend bundling to work across folders?

Bundling works folder wise. So it will take all files under 10MB (chunk size) and try to bundle them within the specific folder. However, bundling can be in 3 parallel flows. So you will not leave the folder until you finished all files there. Still needs a mechanism to allow bigger files to be pumped in one of the flows, to ensure we cover the bandwidth nicely.

felixboehm commented 7 years ago

Using Bundling, I assume exact same sync times with your scenario

Lets do some maths. This will be our folder to sync: 100 files - average 100kB file size -> total 10MB to be transfered 10 files - average 10MB file size -> total 100MB to be transfered

Please don't mix things here, this is not about parallel flows, keep them out.

mrow4a commented 7 years ago

@felixboehm Not sure what you mean, bundling still needs a bookkeeping, for 100 files it will be still 30s parallelised in 3 flows there, and they will be on the server in 2s, waiting next 30s for response. You just lost 30s doing nothing thinking about bandwidth.

Bundling will help with latency though on enterprise scale

felixboehm commented 7 years ago

Bundling would make the server think once (1s) then sending a bundle of 10MB including 100 files in one request. Thats what I thought.

I would assume that bundling will lead to very much requests being of size CHUNK_SIZE, and thus a dynamic chunk size will be able to perfectly utilize the network.

mrow4a commented 7 years ago

@felixboehm Yes, you just found a case in which Bundling will work like that. This is a very idealised case where all files are inside one folder, not distributed in 10 folders, and that our bundle will fit all 100 files. Woboqs suggest to start bundling with only 10 files.

Furthermore, this is why I said with Bundling it will be even more powerful.

felixboehm commented 7 years ago

Not even more powerful - In my opinion with bundling you don't need to alternate file sizes ... A really good approach would be to bundle files folder by folder, without "folder borders".

Bundling with 10 files? Reason? I expect to bundle always to the chunk size - also with dynamic chunk sizes.

jnweiger commented 7 years ago

Files and path names are an ilusion. only inodes and blocks exist. If you call them chunks or bundles that does not matter.

mrow4a commented 7 years ago

@felixboehm @jnweiger Just giving an alternative. I am just afraid of very long running requests with bundling. Each implementation has its pros and cons. With just bundling you would not however achieve full utilization in some cases. Combining two you are nearly sure.

jnweiger commented 7 years ago

@felixboehm as far as I remember, the idea about alternating file sizes was to avoid starvation of small files, while one big video fills all the chunks. What exact strategy is best, only time will tell. The important part is that we implement an API, where we can apply tuning parameters. E.g. Allow priorization based on X.Y.Z. Maybe it turns out, that a shortest job first strategy wins, maybe other users prefer randomized queuing. Maybe it is youngest file first. We should not finialize the strategy now.

Same for dynamic chunk sizes. Maybe there is a clever algorithm so that client and server negotiate the perfect chunk size, maybe the client simply ramps up the size, until the server chokes, then stays clear below that limit, with occasional tests, to see if smaller or bigger sizes result in improvement. Maybe a sysadmin observes a system under load, and simply defines the sizes that work best on monday morning. We cannot tell.

pmaier1 commented 7 years ago

@pmaier1 Do you know how important for the user could be to have sync being completed directory by directory? Does he really care? Taking into account 5s sync and 5h sync. I know this will be very rare case possible from the implementation above, just ask by curiosity.

Well, I think it is the natural expectation that file sync works directory-wise and not in a way that is perceived somehow arbitrary. Anyway in most cases the user probably won't even notice or care. Would the 'connection loss' case cause actual issues or would it just continue syncing when the connection is available again (maybe with some overhead)? Can anyone imagine another scenario where it would be really bad behavior from a UX POV not to sync directory-wise?

What I really like is what @jnweiger said. Would be great to have 'buttons' to play with and run performance tests to see which parameter combinations deliver best results. This could in the end lead to an intelligent algorithm that adjusts parameters to always have the best possible performance for specific scenarios. Anyway, I think real performance tests are needed for this.

felixboehm commented 7 years ago

Ok, so I consider this and the related PR: https://github.com/owncloud/client/pull/5349 as alternative strategy and not needed. Agreed?

We need to focus to the existing strategy of bundling (also across folder borders) and dynamic chunk sizes for network utilization, ... From there, we can only improve based on load / performance tests showing issues. @mrow4a @guruz

mrow4a commented 7 years ago

@felixboehm Why this and #5349 is an alternative strategy for Bundling? These are completely separate concepts which could live separately.

I see a conceptual difference between Prioritizing Items, Trying to fill bandwidth using other files, and bundling which reduces latency influence.

I said giving an alternative, but bundling does not solve all the problems, and introduces new. Its trade-off everywhere. Bundling indeed can sometimes fill the bandwidth, but very offen will not because of bookkeeping. It is lottery.

mrow4a commented 7 years ago

BTW complexity in the code is increased because it is now monolitic around Propagator and Directory classes, I can propose new PR which will separate syncing jobs from data transfer jobs. I will create 2 black boxes and lets discuss.

mrow4a commented 7 years ago

BTW, do you want performance/load test showing a context from this issue? Please ask @davidjericho for an account, sync all your private files, and observe. And lets back to the topic here.

ckamm commented 7 years ago

@mrow4a @felixboehm

The question here seems to be whether the time needed for bundled transfers is still server-processing dominated (like individual small files are) or whether their runtime is dominated by data-transfer time.

If it still is dominated by server processing (and @mrow4a seems to assume it is), having a large-file operation running at the same time could reduce total sync time.

Do we have data about this? How long do 100x 1-byte file uploads take compared to a bundled upload of 100 1-byte files?

davidjericho commented 7 years ago

@ckamm we've been experimenting with this on AARNet some time, and we're very aware of it due to the nature of our continental network and our spread user base. Elimating any file over 1MB (as we do have a large number of ADSL users in Australia), as of this instant the mean is 2.6s for any given file assuming time to transfer is not the issue.

In further experiments, using our own bundling with disassembly of the bundle on the server in our own php code path, we're averaging 30 - 40 MByte/s of 10kb files over a gigabit link and using 100MB bundles, and then calling ownCloud's files scan function to update the filecache table after the files have been placed on disk.

We had a huge ingest from a user about an hour ago, that bumped the mean time for any given file to 5.2 seconds not including transfer-over-the-wire time.

In contrast, I routinely upload large files into the system at over 3.2Gbps, which is the present per thread TLS capabilty of our layer 7 offload servers.

Add to that the cost of TLS round trips over continental scale latencies, and as far as I'm concerned, awareness of latency, server response time, and anything that can be done to make the service respond quicker to these sorts of queries is incredibly important. It's enough that I have an employee researching this full time so we can figure out how we address it ourselves.

mrow4a commented 7 years ago

@ckamm @felixboehm

ANY performance test you do will never be precise, it just can give engineers understanding how it works.

If you start selling that it improves something in general showing fancy graphs, you are crazy. If you however test it on someones setup, and show-case it, then you clearly win.

However, please mind that every performance test is bound to its test setup!

The below graph shows how it will behave on home-styled service without enterprise improvements, on 1ms latency. In server PR I have shown what you can squeze there thinking about server resources https://github.com/owncloud/client/pull/5319

I did a test for ownCloud conference about that, testing for upload.

jnweiger commented 7 years ago

Number of files per second is meaningless on its own. (Because files are just an illusion...) What really matters is bytes/sec over the wire. The above graph has some relevance, as it was done with a constant file size of 100 bytes per file. 60 files/sec equals 6000 bytes/sec in this case.

Do we know the average file size (plus standard deviation of file size) in different user scenarios? Does bundling have the same performance gain with 1000 bytes per file or 10.000 bytes per file?

mrow4a commented 7 years ago

I think we are now offtopic from the issue in the parent post.

felixboehm commented 7 years ago

@mrow4a Please add details on your test results.

Which apps?
How did you test, give me chance to reproduce this.
- files per filesize, link to test, ...
You interpretation: Bundling improves s.th. ? Or is there more?

If you start selling that it improves something in general showing fancy graphs, you are crazy.

Don't sell me anything without testing! Actually you proposed to do find design decisions based on testing on the conference, right?

jnweiger commented 7 years ago

I did not study the initial setup and scenarios in full. If that counts as off-topic, please split the issue. I would not think that is needed, but may help to reduce complexity here. It is an abstract case anyway and everybody wants to make it more concrete, it seems.

mrow4a commented 7 years ago

Which apps? How did you test, give me chance to reproduce this.

files per filesize, link to test, ...

You interpretation: Bundling improves s.th. ? Or is there more?

I think this discussion belongs to discussion here: https://github.com/owncloud/core/pull/25760.

Don't sell me anything without testing! Actually you proposed to do find design decisions based on >testing on the conference, right?

This is right! Further details about bundling I have in presentations which I can provide, also show-case it on our local setup internally. However for the conclusive decision we cannot rely only on the data based on our local server setup in the data-center in the office. I need to compare it with http2 in real enterprise scenario to get some more conclusions. This work is in progress. The results above just gave me an insight to identify where I could search for bottlenecks.

I did not study the initial setup and scenarios in full. If that counts as off-topic, please split the issue. I >would not think that is needed, but may help to reduce complexity here. It is an abstract case anyway >and everybody wants to make it more concrete, it seems.

As above. However, this issue topic is to help pump in in between request without payload requests which have relatively bigger payload, in order to levarage bandwidth fully. Feature called bundling is only partly related here.

This is why I mentioned that to reproduce the issue above, you should just take any enterprise setup with milions of files/shares and many many users, and sync a series of small files -> 100B-5kB, and some bigger files in other folders, or even within the same folder. You should see in any network analyser what I tried to explain here. @ckamm gave a good short version of the problem

mrow4a commented 7 years ago

@felixboehm Have been looking in the code, also doing https://github.com/owncloud/client/pull/5406, and I think this is possible that using Bundling we can achieve similar effect as discussed in first post. Bundling just cannot by default try to parallelise itself to cover all available flows.

jturcotte commented 7 years ago

Bundling would already help making the PHP/WebDAV overhead smaller. What I don't understand is why we accept that 5.2 seconds is a normal request processing delay for one file, basically just to write metadata into a database while Google can search the whole Internet in 0.5 seconds.

Is there really nothing we can do there on the server configuration or code? It sounds like writing the metadata in a json file per user would be faster than the database at that point.

mrow4a commented 7 years ago

This is totaly true, but this is how database is currently designed. This is also why EOS is blazing fast, it is namespace based and stores metadata in the key-value datastore.

BTW: for many files/shares/external storages per user your JSON will take ages

felixboehm commented 7 years ago

I don't see anything near 5.2s per file on the server. This is clearly not an assumption to calculate with or optimize for...

I strongly dislike alternating file sizes. First I want to see bundling in action, performance with bundling in detail. Then take the next step to solve remaining issues.

mrow4a commented 7 years ago

I don't see anything near 5.2s per file on the server. This is clearly not an assumption to calculate with or >optimize for...

My above example is for 1s per file.

I strongly dislike alternating file sizes. First I want to see bundling in action, performance with bundling in detail. Then take the next step to solve remaining issues.

Ok! It will be together with http2 probably.

felixboehm commented 7 years ago

But I very like your work, ideas and prototypes!!

pmaier1 commented 7 years ago

Let's get bundling into the client first and see then if alternating file sizes has a performance impact that can be proven. Then scope the possibilities to have intelligent algorithms to dynamically set parameters for chunk sizes and/or alternating file sizes.

owncloud / client