perfsonar / psconfig-web

pSConfig Web Admin
Apache License 2.0
7 stars 9 forks source link

PWA periodically return invalid JSON for same query #214

Open arlake228 opened 3 years ago

arlake228 commented 3 years ago

If you run the psconfig validate URL command against a PWA URL, about 1/10 times it seems you will get a validation error. PWA is occasionally for some reason returning different JSON for the same call. We have seen this in multiple different contexts:

  1. WLCG originally saw it as it was causing their hosts to recreate tests. A workaround on the client reading the URL has been added, but the server side problem should be fixed.
  2. Independently RNP saw this same issue doing testing on their own.

Identifying what is causing the JSON to change to be invalid will need to be done and then determining the best way to fix it.

grigutis commented 3 years ago

So far, I haven't been successful in recreating this behavior on my development host. Is there any more information about the servers that are exhibiting this issue [e.g., host OS & version, version of PWA installed, installation method (docker or RPM)]?

DanielNeto commented 3 years ago

Hi @grigutis, I've seen this problem when I updated our PWA to the latest version here at RNP. We have a machine with CentOS 7.9 where we run the docker containers using this docker-compose file. I thought it was a bug related to the JSON file size, so I removed most tests and reduced the number of hosts in the mesh, but I still had the problem from time to time. I ended up rolling back to the previous version that was working.

grigutis commented 3 years ago

@DanielNeto I'm still not able to reproduce this. Would you be able to get some logs for me? This should do it:

docker logs -f --since 0m docker_pwa-pub1_1 > ~/pub.log & \
docker logs -f --since 0m docker_pwa-admin1_1 > ~/admin.log & \
docker logs -f --since 0m docker_mongo_1 > ~/mongo.log & \
docker logs -f --since 0m docker_nginx_1 > ~/nginx.log &

Run that, reproduce the problem, then you can kill those jobs and attach the logs to this issue.

grigutis commented 3 years ago

@DanielNeto Actually, maybe logs won't be necessary after all. I finally was able to reproduce this. It is only appearing when configs have tests that use a disjoint topology.

grigutis commented 3 years ago

Thanks to a user in Slack, I now know how to reliability reproduce this problem. It apparently only occurs when the app is under load. For example:

$ ab -n 100 -c 2 https://psconfig.opensciencegrid.org/pub/config/opn-all

and while that is going on, do

$ for i in `seq 1 10` ; do curl -s https://psconfig.opensciencegrid.org/pub/config/opn-all | wc -c ; done

If it's working correctly, you should see the same byte count for all 10 iterations. If it's not, you won't.

I've also been reading a book about Node.js Design Patterns and came across something that sounds like it might be what is causing this issue.

One of the most dangerous situations is to have an API that behaves synchronously under certain conditions and asynchronously under others. … The bug that you've just seen can be extremely complicated to identify and reproduce in a real application. Imagine using a similar function in a web server, where there can be multiple concurrent requests. Imagine seeing some of those requests hanging, without any apparent reason and without any error being logged. This can definitely be considered a nasty defect.

I think the problem lies somewhere here, but that's just a hunch. I see that promise is being overridden in Mongoose, but not sure if that has anything to do with it yet.

grigutis commented 2 years ago

Just to give some more details about this …

A colleague and I took a deeper look at this and when the issue appears, the host_groups_details variable is not being fully populated before the psconfig JSON object is returned.

We're not sure where exactly the error is happening due to the nested async functions and anonymous call backs which make it very confusing to follow, but in general, the flow goes like this (all in meshconfig.js):

exports.generate exports._process_published_config async.eachSeries async.parallel generate_group_members resolve_hostgroup

We made several attempts to fix the problem, but nothing was successful and came to the conclusion that rewriting the "/config/:url" route from scratch was probably the best way forward.

ShawnMcKee commented 2 years ago

Just wondering about the status on this. For OSG/WLCG we are worried that "variable" configs coming from PWA could be part of the problems we are seeing. We can track how often this is occurring using our CheckMK monitoring. For psconfig-itb see https://psetf.aglt2.org/etf/check_mk/index.py?start_url=%2Fetf%2Fpnp4nagios%2Findex.php%2Fgraph%3Fhost%3Dpsconfig-itb%26srv%3Dpsconfig-itb_stats%26theme%3Dmultisite%26baseurl%3D%2Fetf%2Fcheck_mk%2F%26view%3D4 and for psconfig see https://psetf.aglt2.org/etf/check_mk/index.py?start_url=%2Fetf%2Fpnp4nagios%2Findex.php%2Fgraph%3Fhost%3Dpsconfig%26srv%3Dpsconfig_stats%26source%3D0%26theme%3Dmultisite%26baseurl%3D%2Fetf%2Fcheck_mk%2F%26view%3D4

grigutis commented 2 years ago

I'm still working on it, but I would appreciate any help.

I'm working in the issue-214 branch, and the problem seems to be in meshconfig.js. I suspect either in the exports._process_published_config or generate_group_members functions. The problem might be caused by how the generate_group_members function is being called asynchronously.

I'm trying to rewrite the callbacks into promises (async/await) to make the code flow clearer, but this is proving to be a real pain.