peering-manager / peering-manager

BGP sessions management tool
https://peering-manager.net
Apache License 2.0
483 stars 93 forks source link

Generating configuration against large as-set causes proxy error #122

Closed jlixfeld closed 4 years ago

jlixfeld commented 5 years ago

Environment

Steps to Reproduce

Click Configuration on IX that has a peer with a large as-set (ie: As-HURRICANE) using peering-manager defaults and simple template example:

policy-options {
    {%- for group in peering_groups %}
    {%- for asn in group.peers %}
    prefix-list ipv{{ group.ip_version }}-as{{ asn }} {
        {%- for p in prefix_list(asn, group.ip_version) %}
        {{ p.prefix }};
        {%- endfor %}
    }
    {%- endfor %}
    {%- endfor %}
}

Expected Behavior

Observed Behavior

After 30 seconds, 502 Proxy Error is thrown:

Proxy Error

The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /internet-exchanges/ixp-torix-test/configuration/.

Reason: Error reading from remote server

Running bgpq3 from command line takes 17 seconds and returns ~95,000 lines.

root@peeringmanager:/opt/peering-manager/logs# time bgpq3 -h rr.ntt.net -S RIPE,APNIC,AFRINIC,ARIN,NTTCOM,ALTDB,BBOI,BELL,JPIRR,LEVEL3,RADB,RGNET,SAVVIS,TC -4 -A -j -r 8 -R 24 -l prefix_list AS-HURRICANE | wc -l
94725

real    0m17.202s
user    0m1.320s
sys 0m0.248s
root@peeringmanager:/opt/peering-manager/logs#
adamgent commented 5 years ago

There will be another problem with this configuration too - it won't commit on Junos due to max prefix-list entries (85,325) being exceeded. Probably one for another issue, but it will crop us as soon as the timeout is fixed!

ggiesen commented 5 years ago

I'm also seeing this problem with a large number (78) of smaller AS sets (I removed the very large ones - >2000 prefixes).

gmazoyer commented 5 years ago

This issue is probably caused by the timeout of the WSGI process and not the code itself.

I mean the code is responsible because it's taking too long to execute according to the WSGI process. To fix that the code needs to be sped up but since it just spawns a bgpq3 process and waits for its answer I'm not sure how to proceed without some massive caching mechanism.

To summarize, whatever the length of the AS-SET it's the bgpq3 execution time that wil cause the issue.

A "fix" could be to increase the timeout of the WSGI process. In this way it will wait a little bit longer for the feedback from the python code before getting killed.

Regarding the number of entries in a prefix-list for JUNOS I guess this can be addressed in the template itself by controlling the for loop.

ggiesen commented 5 years ago

I can confirm that setting:

timeout = 300

in gunicorn_config.py resolves the issue for me.

netaviator commented 4 years ago

We are having the same issue, too. That's why we currently update the Prefix-Lists referenced in our policies via a Ansible module.

gmazoyer commented 4 years ago

Using the combination of the WSGI timeout increase, caching prefixes inside the database to minimize network I/O and using Redis as caching mechanism should make this issue bearable.

To summarize and minimize the time it takes to generate a config and avoid errors, you can:

  1. Set the timeout of the WSGI process to something bigger such as 5 minutes or even more depending on your needs
  2. Use the built in command to store prefixes in the database and connections to whois servers
  3. Use Redis as cache (which will be available in 1.2.0) to speed up data retrieval by avoiding replaying recent SQL queries