shoal-client add configuration option to prioritize default squids

Over the lifetime of shoal we've always struggled with specific cloud's VMs connecting to squids that were not the expected ones. With the advent of csv2 we have the ability to use yaml to update config files with explicit strings provided by cloudscheduler. For a lot of clouds we know specifically which squid(s) we'd like the VMs to connect to.

After a discussion with Randy and Colin this is what I propose: We introduce a single new configuration option called something like use_static_squids or prioritize_default_squids. If this flag is set then instead of appending the defaults to the end of the list of squids the are placed at the beginning- we could even take it a step further and skip contacting the shoal server unless the vm is unable to contact all of the defaults.

This should allow anyone using shoal to continue using it as is with no change in behaviour while in csv2 we could specify yaml for specific clouds(any or all) that we want to control the squid usage. Having this yaml update the 2 configuration values for the client (default squids and the new "prioritize_default_squids" variable would provide a much more static experience for clouds that have not been using the ideal squids.

Hi Colson, I would like to make it a little simpler; modify shoal client to check for an assigned squids configuration file, say /etc/shoal/assigned_squids.yaml. If it doesn’t exists, the client continues just like it does now. If the new file exists, the file contains URIs for squids to use, in priority order. No need for a flag, the file either exists or doesn’t. Then for clouds for which you want to manually configure squids (so they use the right ones), you just create a cloud yaml in csv2 to write the /etc/shoal/assigned_squids.yaml on the VM. Colin.

On Apr 8, 2021, at 10:58 AM, Colson Driemel @.***> wrote:

Over the lifetime of shoal we've always struggled with specific cloud's VMs connecting to squids that were not the expected ones. With the advent of csv2 we have the ability to use yaml to update config files with explicit strings provided by cloudscheduler. For a lot of clouds we know specifically which squid(s) we'd like the VMs to connect to.

After a discussion with Randy and Colin this is what I propose: We introduce a single new configuration option called something like use_static_squids or prioritize_default_squids. If this flag is set then instead of appending the defaults to the end of the list of squids the are placed at the beginning- we could even take it a step further and skip contacting the shoal server unless the vm is unable to contact all of the defaults.

This should allow anyone using shoal to continue using it as is with no change in behaviour while in csv2 we could specify yaml for specific clouds(any or all) that we want to control the squid usage. Having this yaml update the 2 configuration values for the client (default squids and the new "prioritize_default_squids" variable would provide a much more static experience for clouds that have not been using the ideal squids.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hep-gc/shoal/issues/171, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABESMB3ICBDODEIS3BP5VTTHXVC5ANCNFSM42TL7CSQ.

This should allow anyone using shoal to continue using it as is with no change in behaviour while in csv2 we could specify yaml for specific clouds(any or all) that we want to control the squid usage. Having this yaml update the 2 configuration values for the client (default squids and the new "prioritize_default_squids" variable would provide a much more static experience for clouds that have not been using the ideal squids.

At least over the last year or so when there were issues with squid selection, it turned out that it was not a problem with the selection in shoal, but with the accessibility of squids, e.g. local squids were not accessible for some reason - there was just no "ideal" squid available and it failed over to others. This issue is also completely gone with the new shoal version. In the new version, when it seems a wrong squid is selected then it turned out so far always that the reason is that there are no other, better squids available.

While we could add a new file for general usage and make changes to the client, it will not make anything better until we have more squids available - and in this case the current shoal would give the correct one too. We also should keep in mind that the new shoal client can find a squid running the new shoal agent in the same network by itself without the need to contact the shoal server.

For those reasons, having an additional hard coded list would most likely only help with squids outside of the network/cloud and that are not in shoal. Using a squid outside of a cloud also seems to make sometimes problems with the general squids at sites since the traffic from all VMs seem to come from a single IP. We got such reports from ECDF and they mentioned that Atlas jobs failed over to CERN for that reason; only putting an own squid in the cloud project solved that issue. That issue is not seen in the cvmfs config but only in the job logs since it uses the frontier option.

I think that cvmfs and frontier clients should never be configured to fail over to squids in other networks/clouds unless they are intended for only emergency backups and closely watched for failovers as is the case for the default cvmfs & frontier backup squids at Fermilab & CERN. Whenever there are large numbers of clients in a network or cloud, the clients need their own local squid or squids. When there are a small number of clients, they may be configured with no squids and directly use Cloudflare aliases, which is much better than any fallback squids we could provide for long term use.

The default configuration for CernVM either is now or will be soon to use WLCG Web Proxy Auto Discovery using the cernvm-wpad.{fnal.gov|cern.ch} aliases. Those will discover squids registered in the shoal server or statically by WLCG (that is, CMS or ATLAS). That's the same behavior as other WLCG WPAD aliases, but what's different is that if no squids are found for a geoip organization they keep track of the number of requests over time, and as long as it's a small number from a single geoip organization they return no squids so connections go through to Cloudflare. If the number from one organization per period of time gets high, they redirect connections to Fermilab or CERN backup squids so the WLCG squid operations team can quickly see where new squids need to be provided, and try to get in touch with the users.

I think that cvmfs and frontier clients should never be configured to fail over to squids in other networks/clouds

squids in other networks may be used, depending on how you define "same network" One example a Grid site that has traditional WN and squids and wants to extend it in a dynamic way with VMs on an Openstack instance at the University/Lab - those VMs are in a different network, at least inside of Openstack, than the rest of the site but it should be ok to use the squids outside of this Openstack system. Or in case you have 2 different Openstack projects and only one has squids running, those would also look like 2 different networks. So squids in different networks, within the same site, should be ok to be used, independent if it is inside a cloud, outside a cloud or between a cloud and the rest of the site. But in these cases the shoal-agent's broadcast will not be seen by the clients. And unfortunately not all sites that provide an Openstack access have also a shoal-agent running on their bare-metal squids used for other WN. These squids could only be used if we make those available to the client in a different way then shoal. Wpad may be one way if the squids are registered there, or a static file that contains such squids.

In general, I don't like a static hard-coded approach and would prefer any dynamic process taking care of that. In any case, a solution here should be usable outside of Atlas/CMS too, especially not only usable for our HEP use case but in general by anyone with the need for cvmfs.

In general, I don't like a static hard-coded approach and would prefer any dynamic process taking care of that. In any case, a solution here should be usable outside of Atlas/CMS too, especially not only usable for our HEP use case but in general by anyone with the need for cvmfs.

That's why we use the dynamically registered squids in shoal. The reason why we also use the ones known to ATLAS & CMS is because those are the two huge projects that use Frontier, and since Frontier causes a high load on squid we can be assured that the squids are configured with high capacity. Other projects running at the same sites can also generally use those squids.

hep-gc / shoal

shoal-client add configuration option to prioritize default squids #171