PyActiveStorage and Reductionist: parallel performance, general aspects

stackhpc / reductionist-rs

S3 Active Storage server

Apache License 2.0

3 stars 0 forks source link

PyActiveStorage and Reductionist: parallel performance, general aspects #89

Open valeriupredoi opened 6 months ago

valeriupredoi commented 6 months ago

Hi @markgoddard how's things, mate? Got a very quick question, and thought I'd open an issue since that may benefit of a discussion if the answer is not a simple "no". Can we run Reductionist on multiple threads, of which max threads be specified in the request_data dict? Funny thing is I actualy added num_threads: 100 in the request_data, and our chunks are being analyzed at an average speed 2.5x faster than in normal situations, but I get this message right at the very end of the run, before getting the result:

Traceback (most recent call last):
  File "/home/valeriu/run_gold_test_bigfile.py", line 46, in <module>
    gold_test()
  File "/home/valeriu/run_gold_test_bigfile.py", line 40, in gold_test
    result = active[:]
             ~~~~~~^^^
  File "/home/valeriu/PyActiveStorage/activestorage/active.py", line 158, in __getitem__
    return self._via_kerchunk(index)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/valeriu/PyActiveStorage/activestorage/active.py", line 300, in _via_kerchunk
    return self._get_selection(index)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/valeriu/PyActiveStorage/activestorage/active.py", line 348, in _get_selection
    return self._from_storage(stripped_indexer, drop_axes, out_shape,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/valeriu/PyActiveStorage/activestorage/active.py", line 397, in _from_storage
    result = future.result()
             ^^^^^^^^^^^^^^^
  File "/home/valeriu/miniconda3/envs/pyactive/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/valeriu/miniconda3/envs/pyactive/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/valeriu/miniconda3/envs/pyactive/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/valeriu/PyActiveStorage/activestorage/active.py", line 520, in _process_chunk
    tmp, count = reductionist.reduce_chunk(session,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/valeriu/PyActiveStorage/activestorage/reductionist.py", line 75, in reduce_chunk
    decode_and_raise_error(response)
  File "/home/valeriu/PyActiveStorage/activestorage/reductionist.py", line 203, in decode_and_raise_error
    raise ReductionistError(response.status_code, error)
activestorage.reductionist.ReductionistError: Reductionist error: HTTP 400: {"error": {"message": "request data is not valid", "caused_by": ["Failed to deserialize the JSON body into the target type", "num_threads: unknown field `num_threads`, expected one of `source`, `bucket`, `object`, `dtype`, `byte_order`, `offset`, `size`, `shape`, `order`, `selection`, `compression`, `filters`, `missing` at line 1 column 214"]}}
Command exited with non-zero status 1

Cheers very much in advance :beer:

markgoddard commented 6 months ago

Hi @valeriupredoi, nice to see you're trying to drive the performance up.

The short answer is that the exception is correct. You can't pass num_threads to Reductionist and the request will fail if you do. The API is described here. If you're still getting the correct result (and in less time), then I can only assume you're falling back to a local calculation.

The long answer is:

Reductionist uses the Tokio asynchronous runtime rather than threads. Each individual request is processed sequentially, but many requests may be processed concurrently. So it's important that the client is able to send many requests to the server concurrently in order to see good performance. I added parallel request support to PyActiveStorage some time ago, but it would be good to confirm that it's working.

In general with async runtimes a problem can be that there is too much concurrency, meaning that the server may accept too many requests and exhaust resources (e.g. memory) trying to fulfill them. I did some work on trying to limit the concurrency in a controlled way, but it is not enabled by default.

I'm happy to help you look into the performance if you think it's not working well.

bnlawrence commented 5 months ago

Well, the one thing i would say is that we can't be failing back locally. I don't know what caused these results, but it wasn't that!

valeriupredoi commented 5 months ago

Hi @markgoddard - great explanation, very many thanks! Since I am here, let me pick your brains, when you have a bit of time to spare: consider two cases - one extremo wrt HDF5 chunking, and another fairly standard, let me explain (also have a look at the detailed test results we put together at https://github.com/valeriupredoi/testing_PyActiveStorage )

Extreme HDF5 chunking: 3400 tiny chunks (of about 70kb each)

We (PyActiveStorage/Active with Reductionist) use 150 threads to select, process, and eventually send all these 3400 tiny chunks to Reductionist; here is the timeseries of how long it takes Reductionist to process each of these chunks (note of course time is measured via Active, ie time for response = request(session, url, request_data) to arrive, so it includes network time to and from the remote Reductionist server, but that should be tiny, order miliseconds at wors):

ReductionistTimeseries3400Chunks150Threads

and the same timeseries, but with Active running just one thread:

ReductionistTimeseries3400Chunks1Thread

Running Active with 150 threads: these are tiny chunks, and Active is machine gunning them to Reductionist with 150 active threads: Reductionist seems to get oversaturated and for a large number of chunks Reductionist takes on average about 5s per chunk, and for the rest of them it takes about 1s per chunk
Running Active with just ONE thread (serial run): the timeseries for Reductionist is a lot cleaner and it shows a lot more stability, with most of the chunks taking about 0.2s each; problem is that the total runtime is a lot longer than with 150 threads since this is a serial run

So there is factor 5-10 variability for Reductionist's time to process a tiny 70 odd kb chunk, depending how much load it receives (and this is just from one client!) - in fairness, the 3400 chunks file with such tiny chunks is silly, but it serves as an excercise of testing the limits of our pipeline.

(I'll post the stuff about the other file, the 64 chunks one in the next comment so we don't overextend comments) :beer:

valeriupredoi commented 5 months ago

Decent chunking: 64 chunks (of about 4MB each)

Here things are relatively better in terms of how long it takes Reductionist to analyze one chunk ie there is no such distinct separation between populations of times, but yet still, running Active with 150 threads is very bad for Reductionst; here is the Reductionist time to process one chunk vs chunk size plot, when running with 150 threads:

ReductionistTimesvsChunkSize150Threads

and the same sort of plot but with 1 thread ie serial run:

ReductionistTimesvsChunkSize1Thread

Here too one sees about order 6 variability for average Reductionist times per chunk, depending how much load we send it at one given time.

Note that in the case of a serial Active run, it takes about the same 0.2-0.3s for Reductionist to process either a chunk of 70kB or a chunk of 4MB - which is very cool - but we need to think of ways to optimize its parallel processing since, as it is now, I don't think it's ready yet for a serious thrashing by a number of clients, each running silly small chunked data (OK not 70kB but 1MB default HDF5 chunks haha). What do you think? :beer:

valeriupredoi commented 5 months ago

just to be clear: we deployed Reductionist stock (the version from a month ago) - no tweaks or customization, so we (I) don't mess up anything I don't know how to fix :grin:

markgoddard commented 5 months ago

Hi @valeriupredoi, I've read through the milestone report and the data you've shared here. It's certainly interesting to see it used and the performance analysed. The bimodal chunk processing time distribution is very interesting, and definitely digging into further.

I think overall though it's definitely worth considering what we're trying to optimise for here. My assumption was something like this:

Use server resources to process requests and return responses as fast as possible
Try not to kill the server in the process (e.g. due to memory exhaustion)
Best effort for fairness when processing requests

Consider a simplified setup where the server processes requests sequentially. If the client concurrently sends 100 requests, the server will process them one by one, returning responses when ready. In this case the request latency seen on the server is proportional to the position of the request in the queue, giving a roughly linear plot. This is actually what we see in this graph.

Adding Reductionist's async processing into the setup makes it harder to reason about, especially without the resource management enabled. However we might simplify it as adding some parallelism into the queue processing. This would result in another linear plot with a different gradient.

If we consider the 3400 chunk plot, one key difference here is that the number of chunks exceeds the number of threads. So there is a new behaviour here that we did not see before - Python threads finishing their chunk, then picking up another. I wonder whether this accounts for the some of the bimodal nature?

valeriupredoi commented 5 months ago

Hi @markgoddard - hope you had a good Easter break! Sorry for the delay in responding, we all took a few days off around the time J. H. Christ decided to get on an American Airlines flight and take off on his annual trip :grin:

Very good points! We are now almost ready to test the following setup: @bnlawrence has created two more virtual machines (accessible via the main VM that we've been testing with, with the current remote Reductionist; each of these new VMs have a backend IP address, but the front end/world-facing IP is the same as the older VM), so we are planning on deploying Reductionist on those two new ones, via a slightly adapted Ansible deployment. That way, we are hoping the chunks sent by the client will be distributed among the now three Reductionists. I am not Ansible-literate myself (Bryan is though), and wanted to ask you, how can I modify the inventory configuration so the deployment happens on two or more machines at the same time, please? Ie this file https://github.com/stackhpc/reductionist-rs/blob/main/deployment/inventory

Cheers muchos :beers:

bnlawrence commented 5 months ago

To be specific, the configuration is that we have three nodes one of which has a public IP address. So, active is the public facing node (with two IP addresses), with active2 and active3 only having one each. They can all see the S3 server, but we want the public IP address to be the haproxy address.

markgoddard commented 5 months ago

Hi @valeriupredoi & @bnlawrence. Please see the example in the deployment docs. In your case it might look like this:

[haproxy]
active

[jaeger]
active

[prometheus]
active

[reductionist]
active
active[2:3]

[step:children]
reductionist

[step-ca]
active

[docker:children]
haproxy
jaeger
minio
prometheus
reductionist
step-ca

bnlawrence commented 5 months ago

Thanks. But I think the issue is we were not sure what was going on with the terms that look like:

[haproxy]
localhost ansible_connection=local

what is the ansible_connection doing there?

markgoddard commented 5 months ago

The example inventory is for the simple case where you are running ansible on the same host that you want to deploy reductionist on. ansible_connection=local is used to force use of a local shell rather than SSH for simplicity.

If you're using multiple hosts then you shouldn't use ansible_connection=local for any remote hosts. If the hostnames are not resolvable you can set ansible_host= for them in the inventory. You'd need to specify each host on its own line in order to do that, rather than using the active[2:3] shorthand.

valeriupredoi commented 5 months ago

Hi @markgoddard many thanks for the pointers :beer: I am hitting an issue I can not shake off neither with a new inventory nor with the original inventory file:

[vpredoi@activeh ~]$ ansible-playbook -i reductionist-rs/deployment/inventory reductionist-rs/deployment/site.yml -e 'ansible_python_interpreter=/usr/bin/python3' -vvv
ansible-playbook [core 2.15.10]
  config file = None
  configured module search path = ['/home/vpredoi/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/vpredoi/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /home/vpredoi/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/vpredoi/.local/bin/ansible-playbook
  python version = 3.9.16 (main, May 29 2023, 00:00:00) [GCC 11.3.1 20221121 (Red Hat 11.3.1-4)] (/usr/bin/python3)
  jinja version = 3.1.3
  libyaml = True
No config file found; using defaults
host_list declined parsing /home/vpredoi/reductionist-rs/deployment/inventory as it did not pass its verify_file() method
script declined parsing /home/vpredoi/reductionist-rs/deployment/inventory as it did not pass its verify_file() method
auto declined parsing /home/vpredoi/reductionist-rs/deployment/inventory as it did not pass its verify_file() method
[WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details
Parsed /home/vpredoi/reductionist-rs/deployment/inventory inventory source with ini plugin
ERROR! couldn't resolve module/action 'community.docker.docker_container'. This often indicates a misspelling, missing collection, or incorrect module path.

The error appears to be in '/home/vpredoi/reductionist-rs/deployment/site.yml': line 72, column 7, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

    - name: Ensure step-ca container is running
      ^ here
[vpredoi@activeh ~]$ which python3
/usr/bin/python3

Any clues what's the hap here? Also, to be 100% sure, if I deploy on multiple machines, do I use this inventory?

[haproxy]
activeh
active2
active3

[jaeger]
localhost ansible_connection=local

[minio]
localhost ansible_connection=local

[prometheus]
localhost ansible_connection=local

[reductionist]
activeh
active2
active3

[step:children]
reductionist

[step-ca]
activeh
active2
active3

[docker:children]
haproxy
jaeger
minio
prometheus
reductionist
step-ca

or should I write the IP addresses instead of activeh, active2, active3? Our new head machine is now called activeh that's why the h - no typo, active2 and active3 are backend nodes. Many thanks again, sorry for the n00b questions :beers:

valeriupredoi commented 5 months ago

hang on - I think I know what's the hap here - proves out I don't have docker on this new activeh machine, wish it'd told me in a more straightforward way :rofl:

markgoddard commented 5 months ago

Hi @valeriupredoi. Looks like you missed this step when setting up the new host as an Ansible control host:

ansible-galaxy collection install -r deployment/requirements.yml

Make sure you've run all the steps from https://stackhpc.github.io/reductionist-rs/deployment/#ansible-control-host-setup too.

markgoddard commented 5 months ago

The inventory looks fine, although to be clear I might specify activeh rather than localhost for these groups:

[jaeger]
activeh

[minio]
activeh

[prometheus]
activeh

Regarding hostnames vs IPs, it doesn't matter, as long as Ansible can reach the host. If you have DNS, great. Otherwise you could add entries to /etc/hosts on activeh for active2 and active3. Or you can set ansible_host=<IP> in the inventory, e.g.

[haproxy]
activeh
active2
active3

[jaeger]
activeh

[minio]
activeh

[prometheus]
activeh

[reductionist]
activeh ansible_host=1.2.3.4
active2 ansible_host=1.2.3.5
active3 ansible_host=1.2.3.6

[step:children]
reductionist

[step-ca]
activeh
active2
active3

[docker:children]
haproxy
jaeger
minio
prometheus
reductionist
step-ca

valeriupredoi commented 5 months ago

many thanks @markgoddard - I had to do some pip magic, I replaced pip3 with pip and now all goes fine, well, all apart from deployment :laughing: I can only manage to have it attempt to deploy on only one of the three machines, usually active3 but it's flaky - sometimes it finds a reachable connection to active2 too, but in either case it fails with

TASK [Gathering Facts] *****************************************************************************************************************************
ok: [active3]

TASK [Ensure /etc/haproxy directory exists] ********************************************************************************************************
changed: [active3]

TASK [Ensure haproxy.cfg is templated] *************************************************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'dict object' has no attribute 'default_ipv4'. 'dict object' has no attribute 'default_ipv4'
fatal: [active3]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'default_ipv4'. 'dict object' has no attribute 'default_ipv4'"}

Note that I had trimmed the [step-ca] section to only one machine, otherwise I kept getting

TASK [Gathering Facts] *****************************************************************************************************************************
ok: [active3]

TASK [Assert that there is only one CA server] *****************************************************************************************************
fatal: [active3]: FAILED! => {
    "assertion": "groups['step-ca'] | length == 1",
    "changed": false,
    "evaluated_to": false,
    "msg": "Assertion failed"
}

Any ideas - also @bnlawrence

valeriupredoi commented 5 months ago

after yet another try it connected only to activeh now, went further than the other runs, but still died at the end:

TASK [Assert that there is only one Prometheus server] *********************************************************************************************
ok: [activeh] => {
    "changed": false,
    "msg": "All assertions passed"
}

TASK [Ensure /etc/prometheus directory exists] *****************************************************************************************************
changed: [activeh]

TASK [Ensure CA certificate is copied] *************************************************************************************************************
changed: [activeh]

TASK [Ensure prometheus.yml is templated] **********************************************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'dict object' has no attribute 'default_ipv4'. 'dict object' has no attribute 'default_ipv4'
fatal: [activeh]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'default_ipv4'. 'dict object' has no attribute 'default_ipv4'"}

PLAY RECAP *****************************************************************************************************************************************
active2                    : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0   
active3                    : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0   
activeh                    : ok=33   changed=8    unreachable=0    failed=1    skipped=3    rescued=0    ignored=0

My inventory file:

[haproxy]
activeh
active2
active3

[jaeger]
activeh

[minio]
activeh

[prometheus]
activeh

[reductionist]
activeh
active2
active3

[step:children]
reductionist

[step-ca]
activeh

[docker:children]
haproxy
jaeger
minio
prometheus
reductionist
step-ca

and my /etc/hosts file:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.171.169.zzz activeh
192.168.3.yyy active2
192.168.3.xxx active3

(masked off the last three digits for security purposes)

valeriupredoi commented 5 months ago

Nevermind me! I managed to get the connections done working :partying_face: - bit of jittery pokery with the ssh configuration file and that was that: I have activeh, active2 and active3 all seen by ansibles now, and the thing deploys, though I got this now:

TASK [Bootstrap CA] ********************************************************************************************************************************
skipping: [activeh]
fatal: [active2]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'\n\nThe error appears to be in '/home/vpredoi/reductionist-rs/deployment/site.yml': line 191, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n    - name: Bootstrap CA\n      ^ here\n"}
fatal: [active3]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'\n\nThe error appears to be in '/home/vpredoi/reductionist-rs/deployment/site.yml': line 191, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n    - name: Bootstrap CA\n      ^ here\n"}

What le grand crappe? :grin:

valeriupredoi commented 5 months ago

some update: I managed to step over that error by making a couple changes to site.yml:

diff --git a/deployment/site.yml b/deployment/site.yml
index e7d9b37..24dd776 100644
--- a/deployment/site.yml
+++ b/deployment/site.yml
@@ -185,14 +185,19 @@
       register: ca_fingerprint
       changed_when: false
       delegate_to: "{{ groups['step-ca'][0] }}"
-      run_once: true
+        #run_once: true
       when: not step_stat.stat.exists

+    - name:
+      debug:
+        var: ca_fingerprint
+
     - name: Bootstrap CA
       ansible.builtin.command: >
         step ca bootstrap
         --ca-url https://{{ hostvars[groups['step-ca'][0]].ansible_facts.default_ipv4.address }}:9999
         --fingerprint {{ ca_fingerprint.stdout }} --install
+      run_once: true
       changed_when: true
       when: not step_stat.stat.exists

@@ -343,6 +348,7 @@
         {{ reductionist_host }}
         {{ reductionist_remote_certs_path }}/cert.pem
         {{ reductionist_remote_certs_path }}/key.pem
+      run_once: true
       changed_when: true
       when: not reductionist_cert_stat.stat.exists

and, of course, by creating a couple root CA certificates on active2 and active3, but now reductionist is not pinging correctly:

TASK [Wait for reductionist server to be accessible via HAProxy] ***********************************************************************************
FAILED - RETRYING: [active2]: Wait for reductionist server to be accessible via HAProxy (3 retries left).
FAILED - RETRYING: [active3]: Wait for reductionist server to be accessible via HAProxy (3 retries left).
ok: [activeh]
FAILED - RETRYING: [active3]: Wait for reductionist server to be accessible via HAProxy (2 retries left).
FAILED - RETRYING: [active2]: Wait for reductionist server to be accessible via HAProxy (2 retries left).
FAILED - RETRYING: [active3]: Wait for reductionist server to be accessible via HAProxy (1 retries left).
FAILED - RETRYING: [active2]: Wait for reductionist server to be accessible via HAProxy (1 retries left).
fatal: [active2]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error TLS/SSL connection has been closed (EOF) (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.234:8080/.well-known/reductionist-schema"}
fatal: [active3]: FAILED! => {"attempts": 3, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error TLS/SSL connection has been closed (EOF) (_ssl.c:1129)>", "redirected": false, "status": -1, "url": "https://192.168.3.19:8080/.well-known/reductionist-schema"}

I did open the 8080 ports so am not sure what to do next - will need some help from @bnlawrence methinks :beer:

markgoddard commented 5 months ago

I would say in general that the Ansible playbook is designed to be used with all hosts in the inventory. There's a certain amount of referring to facts of other hosts that are assumed to exist.

I think you should trim down the haproxy group to just activeh. HAProxy is not deployed with any support for failover (e.g. using keepalived), and the certificate is generated for the IP of only one of the HAProxy servers. I'll add a check for the group size similar to step-ca.

Regarding the step CLI bootstrap issues, I think it's a side effect of having configured the hosts separately. I'll update the condition to always grab the CA fingerprint so that it's available if any hosts require it. I'm a bit concerned about your addition of run_once to the bootstrap CA task - it should run against all hosts.

markgoddard commented 5 months ago

Some improvements: https://github.com/stackhpc/reductionist-rs/pull/90

valeriupredoi commented 5 months ago

Hi @markgoddard - legend, I'll grab that and run with it, cheers very much :beer:

I'm a bit concerned about your addition of run_once to the bootstrap CA task - it should run against all hosts.

That may explain the final error I reported, which smelled like an SSL issue, but lemme stash the changes I made and run with #90