Reduce the thin client memory and thread consumtion

poornimag commented 7 years ago

As of today, bringing up any gfapi client pre allocs ~ 25MB (at the minimum). This is too huge when gfapi is loaded in samba processes/qemu/block etc. We need to reduce the consumption to be < 2-3MB.

poornimag commented 7 years ago

Fuse/gfapi client usually allocates ~25-30MB of space(as calculated from the inside memory accounting counters). The memory breakdown is as follows:

12.2MB is iobufs
9MB is allocated by io-stats.
0.8MB by inode_table etc.

But, 25-30MB is not what a user see, when we do simple top -p of any gfapi/FUSE process. We see the following:

No. of glfs_init	VIRT	RSS
0	35MB	1.1MB
1	540MB	10.5MB
2	811MB	15.7MB
3	1.2GB	22.9MB

For analysing the huge VIRT memory consumed by FUSE/gfapi, will just understand memory consumed by some sample multithreaded program [1]. This program just creates 4 threads and allocates 1K bytes of memory. The top -p output shows that the VIRT size of this process, when all the 4 threads are running, is:

VIRT	RSS	SHR
294MB	0.4MB	0.3MB

294MB is rather a really high memory consumption for a program that just creates 4 threads and allocates 1K each. The pmap output breaks down this VIRT memory consumption, below is the output. What we can see is that the [anon] accounts for the most of the memory, of which the 65404K(64MB) and 8192K(8MB) [anon] allocations constitute the most. After playing around with the thread count and some googling, turns out that, each thread creates allocates 72MB at the minimum:

4K ----- [ anon ]
144K rw--- [ anon ]
8192K rw--- [ anon ]-> Thread stack
65404K ----- [ anon ]-> Each thread mem pool (by malloc?) for solving thread contention? But, all this 72*4(288MB) allocated for 4 threads doesn't actually map to any physical memory, its just the address space reservations. With the 64bit systems the address space is so huge that the OS can afford to reserve the address space. If run on older RHEL versions with 32 bit systems, the VIRT memory wouldn't be this huge.

Coming back to the gfapi memory consumption, attached is a sample program that just performs glfs_init()/mount once [2], and here is the top consumption of the same:

VIRT	RSS	SHR
523MB	10.3MB	3.75MB

The pmap output for the same is below. As discussed earlier, the memory breakdown of the gfapi process is as follows:

Memory consumed by mapping all the shared objects: 90MB
Memory consumed by 6 threads for stack and mempool: 295MB
Memory mapped for locale archive: 102MB
Rest of it: 36MB This is almost same as the memory allocation got by internal mem-accounting counters. (TODO: Need to account for another 10MB (36MB- 26MB alloced by process))

Conclusion: So we need not really bother much about the high VIRT consumption, as it doesn't necessarily mean allocation allocations made in the process. But 25-30MB is still a high number, when we want to scale for 10000s of clients (meaning ~300GB of RAM), thus the goal with thin client is to reduce the memory consumption from 25-30MB to 3-5MB (which will need 50GB for 10000 clients).

poornimag commented 7 years ago

[1] glfs_client_and_pmap.txt [2] sample_multithread_pgm_and_pamp.txt

Some reference links: https://utcc.utoronto.ca/~cks/space/blog/linux/LinuxMemoryStats https://siddhesh.in/posts/malloc-per-thread-arenas-in-glibc.html

gluster-ant commented 6 years ago

A patch https://review.gluster.org/20362 has been posted that references this issue. Commit message: iobuf: Get rid of pre allocated iobuf_pool and use per thread mem pool

poornimag commented 6 years ago

The spec for this is gfproxy spec [1]. This is one of the sub-issues of gfproxy. There are no user visible changes or tunables, hence no doc required? Can SpecApproved and DocApproved flags be provided for this issue?

[1] https://review.gluster.org/#/c/19735/

ShyamsundarR commented 6 years ago

The spec for this is gfproxy spec [1]. This is one of the sub-issues of gfproxy. There are no user visible changes or tunables, hence no doc required? Can SpecApproved and DocApproved flags be provided for this issue?

Providing DocApproved.

SpecApproved means we have the technical details that help understand how this change is going to get done. The comments 1 and 2 are quite useful in this regard, but I would like more details on how we are going to accomplish the changes and possibly a list of tasks (as each identified area is going to be tackled in isolation) to be able to provide SpecApproved for this issue.

Tagging @amarts as well for his comments.

mykaul commented 6 years ago

Hi @poornimag - do we have any numbers with this patch to compare to the original? Do we have any perf. numbers to ensure we do not have any issues?

poornimag commented 5 years ago

There were tests run by @KritikaDhananjay as mentioned in the comment 1. That confirms, there were improvements wrt lock contention elimination, and also ran iozone and small_file.py tool locally, there was no noticeable differences in the numbers with this patch.

poornimag commented 5 years ago

As mentioned in previous comment:

Fuse/gfapi client usually allocates ~25-30MB of space(as calculated from the inside memory accounting counters). The memory breakdown is as follows: 12.2MB is iobufs 9MB is allocated by io-stats. 0.8MB by inode_table etc.

The bigger chunks like iobuf and io-stats will be addressed in this issue. And the other memory allocations need to be addressed on per xlator basis.

iobuf: Instead of pre allocating the iobuf buffer, we use the per thread mem pool. This mem pool has no pre allocation but will not have significant perf impact as the last allocated memory is kept alive for next use for some time. The worst case would be if iobufs requested are of random sizes each time. The best case is, if we get iobuf request of the same size.

io-stats: We need not allocate memory if the stats collection is not enabled. And with per xlator stats collection that is already implemented, wondering if we need to retain this.

amarts commented 5 years ago

Tagging @amarts as well for his comments.

Going bit lenient on this one. There are multiple ways of achieving it, and there is no single improvement which would achieve all the required things to resolve the issue.

Considering this issue is going to track improvements on different component across the codebase, going with SpecApproved flag. Will need to check each patch for its merits, and as usual, expect each patch to contain commit message explaining all the changes done in corresponding patch.

stale[bot] commented 4 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 4 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

gluster / glusterfs

Reduce the thin client memory and thread consumtion #325