microsoft / o365-moodle

Office 365 and Azure Active Directory plugins for Moodle
GNU General Public License v3.0
182 stars 136 forks source link

\local_o365\task\groupcreate soft fails frequently #2067

Open thebenkahn opened 2 years ago

thebenkahn commented 2 years ago

Hello, I’d like to bring up an issue we have been seeing frequently, although unfortunately I do not have a lot of solid information to go on. I am wondering if others may have similar issues. Basically, we are seeing two of the main scheduled tasks that the MS365 integration uses “hang” frequently.

The two tasks are the user sync, and group/Teams creation:

\local_o365\task\usersync \local_o365\task\groupcreate

It happens with both tasks but it does seem to happen with groupcreate more often (perhaps just because it runs much more frequently by default)

They do not hard fail, meaning there is no logging or error message on the Moodle front end. Tracing the Apache process associated with the scheduled task, it just seems to try to poll the Microsoft API over and over with no response. Usually we catch this within a day or three but I have seen cases where it has been hung for weeks or even months. We have to go in and kill the PID, then the task restarts and runs normally.

We have not been able to observe any patterns here. It happens on several of our clients sites across different infrastructure/environments/Moodle versions. it just seems to come up frequently enough that it is worth raising and seeing if anyone else is experiencing the same or if there is anything else we could be doing that might help debug. We have dozens of clients that use the plugins and have to do this I'd say an average of 1-3 times per week.

Note: I will say I have not seen this occur on the latest plugin release as of yet, but I am not aware of any changes that would prevent the issue.

Thanks in advance for any thoughts anyone has.

weilai-irl commented 2 years ago

Hi @thebenkahn,

First of all, thank you for reporting the issue.

Could you confirm if the issue have been seen on any particular large sites, or sites with particularly large number of Graph API calls? What I suspect is Graph API throttling may have something to do with this, see https://docs.microsoft.com/en-us/graph/throttling for official document about throttling.

When implementing new Graph API calls or updating relatively old ones in the plugins, one of the main consideration has always been reducing the number and Graph APIs as much as possible, so that it doesn't trigger throttling. However this may not always be possible in some cases, e.g. when running a full user sync, each Graph API call will return roughly 200 users along with their basic profile, so if there are 100K users in the tenant, it will require ~500 calls; however if there is a profile mapping setting in place for the "manager" or "groups" remote field, this will require a separate call for each user, so 100K Graph API calls. Course sync task would be in similar situation - sync of each course may require multiple Graph API calls. This is how fast the number of Graph APIs can build up.

When throttling happens, Graph API may not respond the request in time, which may result in the Microsoft plugin making the request waiting for response for long. There has been some consideration of throttling in the implementation of the plugin, but I'm not confident that they are still working (it was done by the previous maintainer of the plugins and we haven't had the need to touch them yet).

My suggestion would be to keep monitoring this and try to find the situation when this is more likely to happen. i.e. if the sites you saw this happened didn't have large number of users/courses to sync, then throttling may be irrelevant and we should look at other possible causes.

Please keep me updated on this.

Regards, Lai

thebenkahn commented 2 years ago

Hi Lai, many thanks for the information. To my knowledge we do not have any instances where the manager or groups field is mapped. From a quick review the common threads on sites where this happens is 40K plus users and/or using the full sync. Knowing throttling could be the issue we will also check the Health status page if we see this happening. We will likely tweak settings on a few of these sites to see if that improves the situation and will post back here if we notice a change or not. Thanks!