Closed kostjerry closed 2 years ago
Hi @kostjerry I could not identify where is your configuration to fetch the backtrack messages.
I mean, seems that you are missing the configuration to really get the saved message and in fact, the delay you are observing is to receive the first message published after the subscriber connects.
Can you try to do 2 tests?
First, as soon as you connect a client, publish a message with known content and check if this shortens the delay, delivering the message you just publish or the old one.
Second, check the configuration on this example and see if this solve your issue.
If you are still not able to get your saved messages, please send me a curl example of the message you are publishing, then I can reproduce it here with your docker and help you with the configuration.
Hi @wandenberg! Thanks for the quick reply! I have set the backtrack in the push_stream_channels_path as a .b1 suffix. Isn't it sufficient for deliver historical messages to the client on connection?
Sure, I'll try some configurations with push_stream_last_received_message_time and push_stream_last_received_message_tag and I'll get back here with the results.
Thanks!
I did the first test and it seems that there are some queues. Are there internal queues in the module? Or any related nginx queues?
https://user-images.githubusercontent.com/3878623/147820458-9254eda1-5d5a-4ec7-bdbe-072a32b37fe1.mov
@kostjerry There must be something else on your setup causing this behavior.
I used the data you shared to build the application and it is working as expected.
Here are the files and the commands I used. Please, try to use them and see if it also works for you.
If so, slowly migrate the rest of your configuration to this setup to see where the bottleneck is.
Dockerfile
FROM nginx:1.19.6-alpine
RUN apk update \
&& apk upgrade \
&& apk add --update alpine-sdk \
&& apk add --update logrotate \
&& apk add --no-cache openssl \
&& apk add --no-cache bash \
&& apk add --no-cache gcc \
&& apk add --no-cache libc-dev \
&& apk add --no-cache pcre-dev \
&& apk add --no-cache openssl-dev \
&& apk add --no-cache zlib-dev
RUN apk add --no-cache curl
WORKDIR /tmp
RUN wget https://nginx.org/download/nginx-1.19.6.tar.gz \
&& wget https://github.com/wandenberg/nginx-push-stream-module/archive/0.5.4.tar.gz \
&& tar -xf nginx-1.19.6.tar.gz \
&& tar -xf 0.5.4.tar.gz
RUN cd nginx-1.19.6 \
&& ./configure --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --with-perl_modules_path=/usr/lib/perl5/vendor_perl --user=nginx --group=nginx --with-compat --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-Os -fomit-frame-pointer' --with-ld-opt=-Wl,--as-needed --add-dynamic-module=../nginx-push-stream-module-0.5.4 \
&& make modules \
&& cp objs/ngx_http_push_stream_module.so /usr/lib/nginx/modules/
index.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title></title>
</head>
<body>
<script>
var sse = new EventSource("http://localhost:8081/sub/gps_53bc47ac0498da59ffac91f4bc9ddc26____", {
withCredentials: true
});
var first = true;
sse.onopen = function() {
console.log('channel opened');
console.time('delay');
};
sse.onerror = function(e) {
console.log('error', e, sse);
};
sse.onmessage = function(e) {
console.log('message received');
if (first) {
console.timeEnd('delay');
first = false;
console.time('messageGap');
}
else {
console.timeEnd('messageGap');
console.time('messageGap');
}
};
</script>
</body>
</html>
nginx.conf
worker_processes 4;
worker_rlimit_nofile 200000;
pid /run/nginx.pid;
load_module modules/ngx_http_push_stream_module.so;
events {
worker_connections 200000;
multi_accept on;
use epoll;
}
http {
include mime.types;
default_type application/octet-stream;
access_log off;
server_tokens off;
client_body_in_file_only off;
push_stream_shared_memory_size 4096M;
push_stream_max_messages_stored_per_channel 2;
push_stream_message_ttl 5m;
keepalive_requests 10000;
keepalive_timeout 20;
ssl_session_cache shared:SSL:10m;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
reset_timedout_connection on;
client_body_in_single_buffer on;
types_hash_max_size 2048;
server_names_hash_bucket_size 256;
client_max_body_size 5m;
client_body_buffer_size 256k;
client_header_buffer_size 32k;
large_client_header_buffers 4 32k;
client_body_timeout 180s;
client_header_timeout 180s;
resolver 127.0.0.1;
resolver_timeout 10s;
gzip on;
gzip_comp_level 5;
gzip_min_length 1000;
gzip_http_version 1.1;
gzip_types expired no-cache no-store private auth;
gzip_types application/x-javascript text/css image/png;
open_file_cache off; # Disabled for issue 619
charset UTF-8;
server {
listen 81;
charset utf8;
location ~ /sub/([^/]+)/?([^/]+)?/?([^/]+)?/?([^/]+)?/?([^/]+)?/?([^/]+)?$ {
push_stream_subscriber eventsource;
push_stream_channels_path $1_$2_$3_$4_$5_$6.b1;
push_stream_authorized_channels_only on;
add_header Access-Control-Allow-Origin $http_origin;
add_header Access-Control-Allow-Credentials true;
}
}
server {
listen 80;
location ~ /pub/([^/]+)/?([^/]+)?/?([^/]+)?/?([^/]+)?/?([^/]+)?/?([^/]+)?$ {
push_stream_publisher admin;
push_stream_channels_path $1_$2_$3_$4_$5_$6;
push_stream_store_messages on;
}
location /status {
push_stream_channels_statistics;
push_stream_channels_path $arg_id;
}
location / {
root /usr/share/nginx/html;
}
}
}
command to build and run the container
docker build . -t push_stream_issue_297
docker run --rm --name nginx297 -p 8080:80 -p 8081:81 -v $PWD/index.html:/usr/share/nginx/html/index.html -v $PWD/nginx.conf:/etc/nginx/nginx.conf:ro -it push_stream_issue_297
once the container is running,
I published a message with curl localhost:8080/pub/gps_53bc47ac0498da59ffac91f4bc9ddc26____ -d 'test'
And opened a chrome browser to URL http://localhost:8080/
And published another message with curl localhost:8080/pub/gps_53bc47ac0498da59ffac91f4bc9ddc26____ -d 'test1'
Both messages arrived as expected as you can see on the console.
Please, let me know the result of reproducing my steps, then we have a common place to start with.
@wandenberg, I seem to be beginning to understand the problem.
Your example works well because it does not have a high load. Here I created a repository with the ability to add highload. https://github.com/kostjerry/nginx-push-stream-perf-test
In my case, 500 subscribers per channel with messages size of 300k every 5 seconds seem to fill the entire capacity of the module.
@kostjerry for sure the number of subscribers will change the time required to spread the message to all of them, but this is especially true when you are working in a virtualized environment like Docker.
You will need to tune the number of real CPUs available to the docker, the number of Nginx workers, and the link between them.
Think of the module like each channel is a queue, if your subscribers are on the same queue the first one will receive the message quickly and the last one will take longer to receive.
When you have more than one real worker, the queue will be split across them and the subscribers too. Then the time required to deliver the message will be reduced.
I had good benchmarks with the module working on non-virtualized hardware.
So, please, if you really need to run it inside a docker, consider reviewing the configurations for CPU and Memory usage to avoid memory swap and CPU context switch. https://docs.docker.com/config/containers/resource_constraints/
Hi @wandenberg ! Thanks for the suggestion. I've double-checked the resources available in the dockers container with nginx, and I'm sure all CPUs are available there and SWAP is not used. Also, no other process that could use CPUs is running on the test machine.
Could you give more information about how the module works inside? Imagine a situation with two channels each with 500 subscribers. 2 CPUs and 2 nginx workers are available. When I push message on both channels at the same time, what exactly will happen inside the module? scenario1. each worker utilizes a dedicated CPU. 1st worker will take 1st channel and loop synchronously across 500 subscribers of 1st channel. 2nd worker will take 2nd channel and loop synchronously across 500 subscribers of 2nd channel. scenario2. each worker utilizes a dedicated CPU. The combined queue is created with 1000 jobs ("deliver a message to 1st subscriber of the 1st channel", "deliver a message to 1st subscriber of the 2nd channel", "deliver a message to 2nd subscriber of the 1st channel", ...). And then each nginx worker independently takes the next job from the queue and execute it.
scenario1 does not use full capacity of the multi-CPU systems if there is an asymmetric distribution of subscribers by channels. In my case, the distribution is as follows:
Sorry, I have tested and seems that scenario2 is implemented. The subscribers seems to be distributed by some weights across CPUs. Once the distribution is done, each subscriber remains docked to some CPU. I'll test again on the machine with 64CPUs and I'll collect the metrics on subscriber distribution across cores.
Hi @wandenberg !
I did some tests on a machine with 64 CPU cores. Here are the module statistics:
{
"hostname": "fadef73312f2",
"time": "2022-01-03T10:00:47",
"channels": 108,
"wildcard_channels": 0,
"published_messages": 1814,
"stored_messages": 216,
"messages_in_trash": 102,
"channels_in_delete": 0,
"channels_in_trash": 0,
"subscribers": 1683,
"uptime": 1063,
"by_worker": [
{
"pid": "22",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "23",
"subscribers": 1329,
"uptime": 1063
},
{
"pid": "24",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "25",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "26",
"subscribers": 15,
"uptime": 1063
},
{
"pid": "27",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "28",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "29",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "30",
"subscribers": 287,
"uptime": 1063
},
{
"pid": "31",
"subscribers": 1,
"uptime": 1063
},
{
"pid": "32",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "33",
"subscribers": 41,
"uptime": 1063
},
{
"pid": "34",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "35",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "36",
"subscribers": 1,
"uptime": 1063
},
{
"pid": "37",
"subscribers": 2,
"uptime": 1063
},
{
"pid": "38",
"subscribers": 1,
"uptime": 1063
},
{
"pid": "39",
"subscribers": 5,
"uptime": 1063
},
{
"pid": "40",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "41",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "42",
"subscribers": 1,
"uptime": 1063
},
{
"pid": "43",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "44",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "45",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "46",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "47",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "48",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "49",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "50",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "51",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "52",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "53",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "54",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "55",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "56",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "57",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "58",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "59",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "60",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "61",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "62",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "63",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "64",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "65",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "66",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "67",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "68",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "69",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "70",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "71",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "72",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "73",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "74",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "75",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "76",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "77",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "78",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "79",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "80",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "81",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "82",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "83",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "84",
"subscribers": 0,
"uptime": 1063
},
{
"pid": "85",
"subscribers": 0,
"uptime": 1063
}
]
}
I was thinking about cores unavailability in the dockers' container. But there are a few things that makes me reject this reason:
As you can see the main number of subscribers is located on one worker, while the others remain empty. Do you have any suggestions as to why such behavior may occur? Do you have some suggestions what other tests I can do to clarify the situation?
Many thanks for your support!
Hi @wandenberg,
we have found the cause of the issue. This was network bandwidth. After we increased it, the issue disappeared.
I want to draw your attention to what I discovered during testing: the strange distribution of subscribers among workers.
However I close the ticket because it turned out to be unrelated to the module.
Thank you for a wonderful module!
P. S. Just found https://github.com/wandenberg/nginx-push-stream-module/issues/288 reuseport directive corrected the subscribers distribution!
Hi!
We're experiencing a significant delays of first backtrack message delivering after the channel is open. It seems that the number of subscribers matter.
Please find the details below.
The machine configuration is 4 CPU, 16 Gb of RAM. nginx runs on docker container. nginx version is 1.19.6 module version is 0.5.4 Here is Dockerfile's piece of nginx installation:
Here is the statistics at the moment when delay is happening:
Statistics of the problematic channel at the moment when delay is happening:
nginx configuration:
nginx metrics seem to be fine![image](https://user-images.githubusercontent.com/3878623/147809686-649cf4f3-18bb-4d78-9710-1b37f849217e.png)
Client-side part use EventSource class in the javascript. Here is the sample script:
The delay vary from 10 to 30 seconds. When the total number of subscribers is about 1000, the delay don't exceed 1 second. The publication to the channel happens with the interval from 1 to 10 seconds (in average each 6 seconds). Message size is about 350k. Sample script output:
It seems that even not all messages are delivered because sometimes the messageGap could be about 100 seconds while I'm absolutely sure that the backend sends messages without interruption.
Thanks for any help.