Open arlake228 opened 3 years ago
So far, I haven't been successful in recreating this behavior on my development host. Is there any more information about the servers that are exhibiting this issue [e.g., host OS & version, version of PWA installed, installation method (docker or RPM)]?
Hi @grigutis, I've seen this problem when I updated our PWA to the latest version here at RNP. We have a machine with CentOS 7.9 where we run the docker containers using this docker-compose file. I thought it was a bug related to the JSON file size, so I removed most tests and reduced the number of hosts in the mesh, but I still had the problem from time to time. I ended up rolling back to the previous version that was working.
@DanielNeto I'm still not able to reproduce this. Would you be able to get some logs for me? This should do it:
docker logs -f --since 0m docker_pwa-pub1_1 > ~/pub.log & \
docker logs -f --since 0m docker_pwa-admin1_1 > ~/admin.log & \
docker logs -f --since 0m docker_mongo_1 > ~/mongo.log & \
docker logs -f --since 0m docker_nginx_1 > ~/nginx.log &
Run that, reproduce the problem, then you can kill those jobs and attach the logs to this issue.
@DanielNeto Actually, maybe logs won't be necessary after all. I finally was able to reproduce this. It is only appearing when configs have tests that use a disjoint topology.
Thanks to a user in Slack, I now know how to reliability reproduce this problem. It apparently only occurs when the app is under load. For example:
$ ab -n 100 -c 2 https://psconfig.opensciencegrid.org/pub/config/opn-all
and while that is going on, do
$ for i in `seq 1 10` ; do curl -s https://psconfig.opensciencegrid.org/pub/config/opn-all | wc -c ; done
If it's working correctly, you should see the same byte count for all 10 iterations. If it's not, you won't.
I've also been reading a book about Node.js Design Patterns and came across something that sounds like it might be what is causing this issue.
One of the most dangerous situations is to have an API that behaves synchronously under certain conditions and asynchronously under others. … The bug that you've just seen can be extremely complicated to identify and reproduce in a real application. Imagine using a similar function in a web server, where there can be multiple concurrent requests. Imagine seeing some of those requests hanging, without any apparent reason and without any error being logged. This can definitely be considered a nasty defect.
I think the problem lies somewhere here, but that's just a hunch. I see that promise is being overridden in Mongoose, but not sure if that has anything to do with it yet.
Just to give some more details about this …
A colleague and I took a deeper look at this and when the issue appears, the host_groups_details variable is not being fully populated before the psconfig JSON object is returned.
We're not sure where exactly the error is happening due to the nested async functions and anonymous call backs which make it very confusing to follow, but in general, the flow goes like this (all in meshconfig.js):
exports.generate exports._process_published_config async.eachSeries async.parallel generate_group_members resolve_hostgroup
We made several attempts to fix the problem, but nothing was successful and came to the conclusion that rewriting the "/config/:url" route from scratch was probably the best way forward.
Just wondering about the status on this. For OSG/WLCG we are worried that "variable" configs coming from PWA could be part of the problems we are seeing. We can track how often this is occurring using our CheckMK monitoring. For psconfig-itb see https://psetf.aglt2.org/etf/check_mk/index.py?start_url=%2Fetf%2Fpnp4nagios%2Findex.php%2Fgraph%3Fhost%3Dpsconfig-itb%26srv%3Dpsconfig-itb_stats%26theme%3Dmultisite%26baseurl%3D%2Fetf%2Fcheck_mk%2F%26view%3D4 and for psconfig see https://psetf.aglt2.org/etf/check_mk/index.py?start_url=%2Fetf%2Fpnp4nagios%2Findex.php%2Fgraph%3Fhost%3Dpsconfig%26srv%3Dpsconfig_stats%26source%3D0%26theme%3Dmultisite%26baseurl%3D%2Fetf%2Fcheck_mk%2F%26view%3D4
I'm still working on it, but I would appreciate any help.
I'm working in the issue-214 branch, and the problem seems to be in meshconfig.js. I suspect either in the exports._process_published_config
or generate_group_members
functions. The problem might be caused by how the generate_group_members
function is being called asynchronously.
I'm trying to rewrite the callbacks into promises (async/await) to make the code flow clearer, but this is proving to be a real pain.
If you run the
psconfig validate URL
command against a PWA URL, about 1/10 times it seems you will get a validation error. PWA is occasionally for some reason returning different JSON for the same call. We have seen this in multiple different contexts:Identifying what is causing the JSON to change to be invalid will need to be done and then determining the best way to fix it.