scylladb / scylla-machine-image

Apache License 2.0
18 stars 25 forks source link

Optimize cloud detection #483

Closed syuu1228 closed 6 months ago

syuu1228 commented 9 months ago

Currently get_cloud_instance() is very slow on GCE / Azure, since it fails to detect AWS metadata server, and retrying multiple times in curl().

Actually we don't really need to do detection, we can specify target cloud name statically when we build machine-image.

To do so, adding target cloud name to machine_image_info.yaml while building machine-image, and reference it from scylla_cloud.py.

Note that we still need to fallback to metadata server based detection, since there are users who manually install scylla-machine-image package , not using our machine-image.

Fixes #481

syuu1228 commented 9 months ago

Tested at https://jenkins.scylladb.com/view/master/job/scylla-master/job/releng-testing/job/next-machine-image/196/

mykaul commented 9 months ago

nit: you should be able to reliably detect cloud based on DMI data - from https://coroot.com/blog/cloud-metadata :

AWS Xen instances: the uuid from /sys/hypervisor/uuid has the "ec2" or "EC2" prefix
AWS Nitro instances: the content of /sys/class/dmi/id/board_vendor is "Amazon EC2"
GCP: the content of /sys/class/dmi/id/board_vendor is "Google"
Azure: the content of /sys/class/dmi/id/board_vendor is "Microsoft Corporation"
syuu1228 commented 9 months ago

nit: you should be able to reliably detect cloud based on DMI data - from https://coroot.com/blog/cloud-metadata :

AWS Xen instances: the uuid from /sys/hypervisor/uuid has the "ec2" or "EC2" prefix
AWS Nitro instances: the content of /sys/class/dmi/id/board_vendor is "Amazon EC2"
GCP: the content of /sys/class/dmi/id/board_vendor is "Google"
Azure: the content of /sys/class/dmi/id/board_vendor is "Microsoft Corporation"

I also considered that, it would works well on AWS / GCP, but on Azure DMI conteints are exactly same as baremetal Hyper-V. To cause correct error message "Unknown cloud provider! Only AWS/GCP/Azure supported" on baremetal Hyper-V, we still need to keep metadata server access on azure_instance.is_azure_instance(), so we cannot drop network access entirely from cloud detection.

see: https://stackoverflow.com/questions/11570965/how-to-detect-azure-amazon-vm

Maybe we can wait metadata server access just for Azure, since it will likely success because in that case we already know it's not AWS nor GCP.

Or, maybe we can ignore about non-cloud environment, since scylla-machine-image script will does not work on such environment at some point. But it will not clealer message I mentioned above, it will cause Traceback at some point instead.

mykaul commented 9 months ago

nit: you should be able to reliably detect cloud based on DMI data - from https://coroot.com/blog/cloud-metadata :

AWS Xen instances: the uuid from /sys/hypervisor/uuid has the "ec2" or "EC2" prefix
AWS Nitro instances: the content of /sys/class/dmi/id/board_vendor is "Amazon EC2"
GCP: the content of /sys/class/dmi/id/board_vendor is "Google"
Azure: the content of /sys/class/dmi/id/board_vendor is "Microsoft Corporation"

I also considered that, it would works well on AWS / GCP, but on Azure DMI conteints are exactly same as baremetal Hyper-V. To cause correct error message "Unknown cloud provider! Only AWS/GCP/Azure supported" on baremetal Hyper-V, we still need to keep metadata server access on azure_instance.is_azure_instance(), so we cannot drop network access entirely from cloud detection.

see: https://stackoverflow.com/questions/11570965/how-to-detect-azure-amazon-vm

Maybe we can wait metadata server access just for Azure, since it will likely success because in that case we already know it's not AWS nor GCP.

Or, maybe we can ignore about non-cloud environment, since scylla-machine-image script will does not work on such environment at some point. But it will not clealer message I mentioned above, it will cause Traceback at some point instead.

We can ignore Hyper-V, or treat it just like Azure - either path is fine, IMHO.

fruch commented 9 months ago

We could do something similar to

https://github.com/dgzlopes/cloud-detect/blob/master/cloud_detect/providers/azure_provider.py

mykaul commented 9 months ago

We could do something similar to

https://github.com/dgzlopes/cloud-detect/blob/master/cloud_detect/providers/azure_provider.py

This needs the networking stack to be up (and IPv4 configured, what about IPv6 only?) - we'd like to avoid that.

avikivity commented 9 months ago

There are also metal instances, which might be different.

mykaul commented 9 months ago

There are also metal instances, which might be different.

That could be the fallback if above detection fails.

syuu1228 commented 9 months ago

Pushed new implementation which uses DMI for the detection, and async/await for metadata server access (when DMI probe failed). The implementation referenced https://github.com/dgzlopes/cloud-detect, borrowed some ideas:

syuu1228 commented 9 months ago

Ah, it breaking the unittest, fixing...

syuu1228 commented 9 months ago

Testcase fixed.

yaronkaikov commented 9 months ago

@syuu1228 Please rebase and re-test the verification job

syuu1228 commented 9 months ago

LGTM

even that I would prefer using https://github.com/dgzlopes/cloud-detect directly and not copying it's implementation (even it's MIT license, and we cloud do it legally)

We actually can package cloud-detect package into scylla-python3 by updating install-dependencies.sh in scylla repo, but I did not prefer to update that just for scylla-machine-image so I haven't added any package for scylla-machine-image scripts. That's the reason why I haven't used cloud-detect in this PR, and also other reason is we already have metadata based detection and DMI based one should be very small change.

But if we copy scylla-cqlsh implementation which adds zipped pip package into relocatable package, then we can use pip packages on scylla-machine-image too, that would be alternative way to implement this.

fruch commented 9 months ago

LGTM

even that I would prefer using https://github.com/dgzlopes/cloud-detect directly and not copying it's implementation (even it's MIT license, and we cloud do it legally)

We actually can package cloud-detect package into scylla-python3 by updating install-dependencies.sh in scylla repo, but I did not prefer to update that just for scylla-machine-image so I haven't added any package for scylla-machine-image scripts. That's the reason why I haven't used cloud-detect in this PR, and also other reason is we already have metadata based detection and DMI based one should be very small change.

But if we copy scylla-cqlsh implementation which adds zipped pip package into relocatable package, then we can use pip packages on scylla-machine-image too, that would be alternative way to implement this.

A simple vendoring of the code as a submodule, might also work in this case.

syuu1228 commented 7 months ago

even that I would prefer using https://github.com/dgzlopes/cloud-detect directly and not copying it's implementation (even it's MIT license, and we cloud do it legally)

We actually can package cloud-detect package into scylla-python3 by updating install-dependencies.sh in scylla repo, but I did not prefer to update that just for scylla-machine-image so I haven't added any package for scylla-machine-image scripts. That's the reason why I haven't used cloud-detect in this PR, and also other reason is we already have metadata based detection and DMI based one should be very small change. But if we copy scylla-cqlsh implementation which adds zipped pip package into relocatable package, then we can use pip packages on scylla-machine-image too, that would be alternative way to implement this.

A simple vendoring of the code as a submodule, might also work in this case.

Tried to add cloud-detect as submodule and load from scylla-machine-image, but it seems difficult because the module is hevily depend with aiohttp, witch contains C extension. Probably we should just update scylla-python3 dependencies to add the module since it's simplest solution.

avikivity commented 7 months ago

Is there no simple way to do this without adding complicated packages?

syuu1228 commented 7 months ago

Is there no simple way to do this without adding complicated packages?

I think current patch on this PR does this. It referenced cloud-detect, borrowed some ideas, but not adding additional packages: https://github.com/scylladb/scylla-machine-image/pull/483#issuecomment-1742664119

benipeled commented 7 months ago

@syuu1228 if there is nothing else to add for this change - please verify it and make sure the slowness issue resolved so we can merge it

syuu1228 commented 6 months ago

@yaronkaikov Tested on https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/next-machine-image/264/, please merge this.