scylladb / seastar

High performance server-side application framework
http://seastar.io
Apache License 2.0
8.38k stars 1.55k forks source link

When perftune.py is executed on and im4gn.8xlarge (AWS ARM) instance, the results of `--get-irq-cpu-mask` and `--dump-options-file` conflict #1783

Closed ShlomiBalalis closed 1 year ago

ShlomiBalalis commented 1 year ago

Issue description

When examining the results of perftune.py across all supported instances in AWS, we noticed that when running on an im4gn.8xlarge instance, there is a mismatch between the value of irq_cpu_mask in the output of--dump-options-file, and the output of --get-irq-cpu-mask, when they should be identical:

< t:2023-07-23 23:45:08,012 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylladb/scripts/perftune.py --tune net --nic eth0 --get-cpu-mask"...
< t:2023-07-23 23:45:08,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 0xfffffffc

< t:2023-07-23 23:45:08,953 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylladb/scripts/perftune.py --tune net --nic eth0 --get-irq-cpu-mask"...
< t:2023-07-23 23:45:09,358 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 0x00000003

< t:2023-07-23 23:45:09,858 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylladb/scripts/perftune.py --tune net --nic eth0 --dump-options-file --mode sq_split"...
< t:2023-07-23 23:45:10,177 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > cpu_mask: '0xffffffff'
< t:2023-07-23 23:45:10,177 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > irq_core_auto_detection_ratio: 16
< t:2023-07-23 23:45:10,177 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > irq_cpu_mask: '0x00000001'
< t:2023-07-23 23:45:10,177 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > nic:
< t:2023-07-23 23:45:10,177 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > - eth0
< t:2023-07-23 23:45:10,177 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > tune:
< t:2023-07-23 23:45:10,177 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > - net
< t:2023-07-23 23:45:10,177 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 

Installation details

Kernel Version: 5.15.0-1039-aws Scylla version (or git commit hash): 5.4.0~dev-20230717.567b4536892f with build-id aeddfeffed882ccadcef1106c18736e2200efba4

Cluster size: 1 nodes (im4gn.8xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0b76b3c882188e46a (aws: us-east-1)

Test: artifacts-ami-arm-print_perftune Test id: b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83 Test name: scylla-staging/Shlomo/artifacts-ami-arm-print_perftune Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83) - Show all stored logs command: `$ hydra investigate show-logs b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83` ## Logs: - **db-cluster-b051cb11.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83/20230723_234548/db-cluster-b051cb11.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83/20230723_234548/db-cluster-b051cb11.tar.gz) - **sct-runner-events-b051cb11.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83/20230723_234548/sct-runner-events-b051cb11.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83/20230723_234548/sct-runner-events-b051cb11.tar.gz) - **sct-b051cb11.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83/20230723_234548/sct-b051cb11.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83/20230723_234548/sct-b051cb11.log.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/Shlomo/job/artifacts-ami-arm-print_perftune/5/) [Argus](https://argus.scylladb.com/test/a4edc76e-1ad2-432a-b116-36e0123836a5/runs?additionalRuns[]=b051cb11-fe8e-4d0c-8a0f-78cf91bf0a83)
mykaul commented 1 year ago

Generally, I think we lack ARM support in the script. Random example - https://github.com/scylladb/seastar/blob/6c544e02d700b7d4702764cb0477b885727d10be/scripts/perftune.py#L1087

vladzcloudius commented 1 year ago

And how is this a conflict exactly, @ShlomiBalalis ? You are running one command without --mode sq_split and then you run the second command with.

mykaul commented 1 year ago

@ShlomiBalalis - is the YAML you've attached correct? It has 'instance_type_db: 'i3.large'' in it.

vladzcloudius commented 1 year ago

Generally, I think we lack ARM support in the script. Random example -

https://github.com/scylladb/seastar/blob/6c544e02d700b7d4702764cb0477b885727d10be/scripts/perftune.py#L1087

Not true, @mykaul. The line above only has the combination where we know we want to change the clock settings. For ARM we simply don't want to do that at the moment.

To my best knowledge scrips support ARM just fine. If you have a different information, please, report.

I suggest closing this issue since the reported output is expected given the provided arguments.

@ShlomiBalalis keep in mind that ARM doesn't have HT support, hence every CPU is a physical core.

ShlomiBalalis commented 1 year ago

Generally, I think we lack ARM support in the script. Random example -

https://github.com/scylladb/seastar/blob/6c544e02d700b7d4702764cb0477b885727d10be/scripts/perftune.py#L1087

If that's true, I was not told about it. Literally all of the other arm instance types that were tested were consistent.

And how is this a conflict exactly, @ShlomiBalalis ? You are running one command without --mode sq_split and then you run the second command with.

You mean the --get-irq-cpu-mask command? https://docs.google.com/document/d/1D0aVkeMDHqeJdzB5tw5s-34Ihjaf2D9Z8G-qJUv-WQY/edit?pli=1 You instructions never stated that I should use the mode parameter with it (which would not make sense for the larger machines, since I need to use the result of --get-irq-cpu-mask to execute --dump-options-file

@ShlomiBalalis - is the YAML you've attached correct? It has 'instance_type_db: 'i3.large'' in it.

The instance type in the yaml is irrelavant, since it was overridden in the jenkins jobs parameters

vladzcloudius commented 1 year ago

Let's move this discussion to the correct context, @ShlomiBalalis

mykaul commented 1 year ago

To my best knowledge scrips support ARM just fine. If you have a different information, please, report.

I don't - I wasn't sure if this was intentional or not. BTW, would be nice to see if it works well on https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-c7gd-m7gd-and-r7gd-powered-by-aws-graviton3-processor-with-local-nvme-based-ssd-storage/

vladzcloudius commented 1 year ago

To my best knowledge scrips support ARM just fine. If you have a different information, please, report.

I don't - I wasn't sure if this was intentional or not. BTW, would be nice to see if it works well on https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-c7gd-m7gd-and-r7gd-powered-by-aws-graviton3-processor-with-local-nvme-based-ssd-storage/

To my best knowledge the configuration mode will be as advertised. Would it be a good mode for this particular platform - only an extensive performance benchmarking can tell.