pinpoint-apm / pinpoint-c-agent

Pinpoint C Agent helps your monitor your PHP/PYTHON applications into [pinpoint-apm](https://github.com/pinpoint-apm/pinpoint).
http://pinpoint-apm.github.io/pinpoint/
Apache License 2.0
262 stars 77 forks source link

[StreamServerLayer.py:68] peer error:[Errno 104] Connection reset by peer #296

Closed marty-macfly closed 3 years ago

marty-macfly commented 3 years ago

I'm running a PHP cron job on Kubernetes which ending with the following trace:

[pinpoint] [1] [1]#55 pinpoint_end_trace Done!
--
  | [pinpoint] [1] [1]#128 add clues:330:20765.640625 KB
  | [pinpoint] [1] [1]this span:({"E":12325,"FT":1500,"S":1610711645350,"appid":"sid.ping-lost","appname":"sid","calls":[{"E":34,"S":142,"calls":[{"E":5,"S":165,"clues":["-1:SHOW FULL COLUMNS FROM `audit_entry`","21:SHOW FULL COLUMNS FROM `audit_entry`"],"dst":"mysql:host=uat.fr;port=3306;dbname=sid-uat","name":"yii\\db\\Command::internalExecute","stp":"2101"}],"clues":["-1:fetchAll","21:SHOW FULL COLUMNS FROM `audit_entry`"],"dst":"mysql:host=uat.fr;port=3306;dbname=sid-uat","name":"yii\\db\\Command::queryInternal","stp":"2101"},{"E":9,"S":178,"calls":[{"E":8,"S":179,"clues":["-1:SELECT\n    `kcu`.`CONSTRAINT_NAME` AS `constraint_name`,\n    `kcu`.`COLUMN_NAME` AS `column_name`,\n    `kcu`.`REFERENCED_TABLE_NAME` AS `referenced_table_name`,\n    `kcu`.`REFERENCED_COLUMN_NAME` AS `referenced_column_name`\nFROM `information_schema`.`REFERENTIAL_CONSTRAINTS` AS `rc`\nJOIN `information_schema`.`KEY_COLUMN_USAGE` AS `kcu` ON\n    (\n        `kcu`.`CONSTRAINT_CATALOG` = `rc`.`CONSTRAINT_CATALOG` OR\n        (`kcu`.`CONSTRAINT_CATALOG` IS NULL AND `rc`.`CONSTRAINT_CATALOG` IS NULL)\n    ) AND\n    `kcu`.`CONSTRAINT_SCHEMA` = `rc`.`CONSTRAINT_SCHEMA` AND\n    `kcu`.`CONSTRAINT_NAME` = `rc`.`CONSTRAINT_NAME`\nWHERE `rc`.`CONSTRAINT_SCHEMA` = database() AND `kcu`.`TABLE_SCHEMA` = database()\nAND `rc`.`TABLE_NAME` = 'audit_entry' AND `kcu`.`TABLE_NAME` = 'audit_entry'","21:SELECT\n    `kcu`.`CONSTRAINT_NAME` AS `constraint_name`,\n    `kcu`.`COLUMN_NAME` AS `column_name`,\n    `kcu`.`REFERENCED_TABLE_NAME` AS `referenced_table_name`,\n    `kcu`.`REFERENCED_COLUMN_NAME` AS `referenced_column_name`\nFROM `information_schema`.`REFERENTIAL_CONSTRAINTS` AS `rc`\nJOIN `information_schema`.`KEY_COLUMN_USAGE` AS `kcu` ON\n    (\n        `kcu`.`CONSTRAINT_CATALOG` = `rc`.`CONSTRAINT_CATALOG` OR\n        (`kcu`.`CONSTRAINT_CATALOG` IS NULL AND `rc`.`CONSTRAINT_CATALOG` IS NULL)\n    ) AND\n    `kcu`.`CONSTRAINT_SCHEMA` = `rc`.`CONSTRAINT_SCHEMA` AND\n    `kcu`.`CONSTRAINT_NAME` = `rc`.`CONSTRAINT_NAME`\nWHERE `rc`.`CONSTRAINT_SCHEMA` = database() AND `kcu`.`TABLE_SCHEMA` = database()\nAND `rc`.`TABLE_NAME` = 'audit_entry' AND `kcu`.`TABLE_NAME` = 'audit_entry'"],"dst":"mysql:host=uat..fr;port=3306;dbname=sid-uat","name":"yii\\db\\Command::internalExecute","stp":"2101"}],"clues":["-1:fetchAll","21:SELECT\n    `kcu`.`CONSTRAINT_NAME` AS `constraint_name`,\n    `kcu`.`COLUMN_NAME` AS `column_name`,\n    `kcu`.`REFERENCED_TABLE_NAME` AS `referenced_table_name`,\n    `kcu`.`REFERENCED_COLUMN_NAME` AS `referenced_column_name`\nFROM `information_schema`.`REFERENTIAL_CONSTRAINTS` AS `rc`\nJOIN `information_schema`.`KEY_COLUMN_USAGE` AS `kcu` ON\n    (\n        `kcu`.`CONSTRAINT_CATALOG` = `rc`.`CONSTRAINT_CATALOG` OR\n        (`kcu`.`CONSTRAINT_CATALOG` IS NULL AND `rc`.`CONSTRAINT_CATALOG` IS NULL)\n    ) AND\n    `kcu`.`CONSTRAINT_SCHEMA` = `rc`.`CONSTRAINT_SCHEMA` AND\n    `kcu`.`CONSTRAINT_NAME` = `rc`.`CONSTRAINT_NAME`\nWHERE `rc`.`CONSTRAINT_SCHEMA` = database() AND `kcu`.`TABLE_SCHEMA` = database()\nAND `rc`.`TABLE_NAME` = 'audit_entry' AND `kcu`.`TABLE_NAME` = 'audit_entry'"],"dst":"mysql:host=uat..fr;port=3306;dbname=sid-uat","name":"yii\\db\\Command::queryInternal","stp":"2101"},{"E":12031,"S":206,"calls":[{"E":185,"S":210,"calls":[{"E":18,"EXP":"Redis error: ERR This instance has cluster support disabled\nRedis command was: CLUSTER INFO","S":221,"calls":[{"E":15,"S":221,"calls":[{"E":2,"S":234,"calls":[{"E":0,"S":234,"name":"yii\\redis\\Connection::open"},{"E":2,"S":234,"name":"yii\\redis\\Connection::sendCommandInternal","stp":"1501"}],"clues":["-1:SELECT 3"],"name":"yii\\redis\\Connection::executeCommand","stp":"1501"}],"dst":"tcp://uat.fr:6379, database=3","name":"yii\\redis\\Connection::open","stp":"8200"},{"E":3,"EXP":"Redis error: ERR This instance has cluster support disabled\nRedis command was: CLUSTER INFO","S":236,"name":"yii\\redis\\Connection::sendCommandInternal","stp":"1501"}],"clues":["-1:CLUSTER INFO "],"name":"yii\\redis\\Connect
  | [pinpoint] [1] [1]agent try to connect:(tcp:pinpoint-collector-agent:8181)
  | [pinpoint] [1] [1]#128 pinpoint_end_trace Done!

On collector agent which is pinpoint-collector-agent:8181 v0.4.1, I've got the following:

[2021-01-15 11:54:17,683] [30] [INFO] [StreamServerLayer.py:141] new connect fd:8
[2021-01-15 11:54:17,683] [30] [DEBUG] [FrontAgent.py:90] send hello:8 len:59
[2021-01-15 11:54:17,684] [30] [DEBUG] [StreamServerLayer.py:97] send 59
[2021-01-15 11:54:17,704] [30] [WARNING] [StreamServerLayer.py:68] peer error:[Errno 104] Connection reset by peer
[2021-01-15 11:54:17,705] [30] [WARNING] [StreamServerLayer.py:168]  client:8 disconnected
[2021-01-15 11:54:17,705] [30] [DEBUG] [StreamServerLayer.py:74] close client:8(('10.131.0.1', 51576))

And not trace is sent to the UI.

marty-macfly commented 3 years ago

When I'm adding a sleep after the run of the php command in the cronjob, the message become the following:

[2021-01-15 12:28:24,806] [30] [INFO] [StreamServerLayer.py:141] new connect fd:8
[2021-01-15 12:28:24,814] [30] [DEBUG] [FrontAgent.py:90] send hello:8 len:59
[2021-01-15 12:28:24,814] [30] [WARNING] [StreamServerLayer.py:63] peer close socket
[2021-01-15 12:28:24,814] [30] [WARNING] [StreamServerLayer.py:168]  client:8 disconnected
[2021-01-15 12:28:24,814] [30] [DEBUG] [StreamServerLayer.py:74] close client:8(('10.131.0.1', 45766))

No more reset but still no trace in UI.

eeliu commented 3 years ago

@marty-macfly It looks like the php process was stopped without send all span.

Please share more detail about how does your "cron jop" works.

marty-macfly commented 3 years ago

it's a kubernetes cron job at the given time a container will be startup running the given command line here it's php yii .... the job run and at the end of the run of the php cli run, the container is removed. Collector agent is in another container and always running and listening on port 8080 for trace.

eeliu commented 3 years ago

It works in our env without changing everything.

But, you can

Try this

pinpoint_php.SendSpanTimeOutMs= 100 # send span with timeout 100ms

image

marty-macfly commented 3 years ago

Does it mean it will send a span every 100ms ? So if I enable report I should see something like that:

agent try to connect:(tcp:pinpoint-collector-agent:8181)
#128 pinpoint_end_trace Done!

Every 100ms ?

eeliu commented 3 years ago

Does it mean it will send a span every 100ms ? So if I enable report I should see something like that:

agent try to connect:(tcp:pinpoint-collector-agent:8181)
#128 pinpoint_end_trace Done!

Every 100ms ?

First, does it meet your requirement? Then, NO. 100ms is time for the waiting_event timeout.

marty-macfly commented 3 years ago

Doesn't change anything still same output of the report and the reset message on collector agent log.

eeliu commented 3 years ago

I found your container will be removed every time. Sorry to tell you, current agent can't support.

Before supporting this feature, there are some issues need your help

  1. does the script running in a new container every time? if not agent could go to normal at next time.

@marty-macfly

marty-macfly commented 3 years ago

Agent is in a dedicated container which is always running that why I'm not using socket in that case but tcp:port. So agent always running and listening on port 8181.

The cli with the php pinpoint extension is running in a container which is created and destroy every-time and is sending data to agent through tcp on port 8181.

eeliu commented 3 years ago

Agent is in a dedicated container which is always running that why I'm not using socket in that case but tcp:port. So agent always running and listening on port 8181.

The cli with the php pinpoint extension is running in a container which is created and destroy every-time and is sending data to agent through tcp on port 8181.

For your description, the collector-agent is OK. The plugins and pinpoint extension not support this feature.

Thank your feedback.

marty-macfly commented 3 years ago

Just a recap of our current setup:

php.ini configuration for pinpoint-php-module

extension=pinpoint_php.so
pinpoint_php.CollectorHost=tcp:pinpoint-collector-agent:8181
pinpoint_php.SendSpanTimeOutMs=100
pinpoint_php.TraceLimit=-1 

Configuration of collector agent is the following:

[Collector]
ServerType=1500
collector.grpc.agent.ip=pinpoint-collector.pinpoint-uat.svc
collector.grpc.agent.port=9991
collector.grpc.stat.ip=pinpoint-collector.pinpoint-uat.svc
collector.grpc.stat.port=9992
collector.grpc.span.ip=pinpoint-collector.pinpoint-uat.svc
collector.grpc.span.port=9993
collector.grpc.discardpolicy.maxpendingthreshold=10000

[Common]
# Apache webserver port
Web_Port=8080
Log_Level=INFO
LOG_DIR=/var/log/pinpoint-collector-agent

[Agent]
Address=0.0.0.0@8181

So if I understand properly in that setup it's pinpoint-php-module that doesn't support it ? Did I properly understood ? Thanks for your help and support.

marty-macfly commented 3 years ago

I'm updating that issue because I've found the following. We've seen that some run were not working as explain in previous comment here. But some use case were working for example in case of CLI running on the same container as the php-fpm process. Or when I've got long running CLI (workers). In both case I was using the agent version 0.4.2 in python with php extension sending data through unix socket.

We've update the agent to the Go version of the agent and some use case that was working before are not working anymore. For example long running CLI and CLI run inside the php-fpm container. Difference is we are now using tcp connection to send data from the PHP extension to the GO agent because Go agent is not supporting socket.

After some investigation I've seen the following when testing on my local laptop CLI tracing are still working but when testing on test environment where agent and PHP extension are on different host it's not working anymore but php-fpm tracing work. when I'm running CLI in debug I've seen the following difference from local and test env on loca:

CLI debug report is:

[pinpoint] [72] [72]this span:(....)
[pinpoint] [72] [72]agent try to connect:(tcp:192.168.254.1:8088)
[pinpoint] [72] [72]fd 5 send size 579
[pinpoint] [72] [72]#128 pinpoint_end_trace Done!

On test env were agent and CLI is not on the same host trace is the following:

[pinpoint] [72] [72]this span:(....)
[pinpoint] [72] [72]agent try to connect:(tcp:192.168.254.1:8088)
[pinpoint] [72] [72]#128 pinpoint_end_trace Done!

We can see that we don't have the fd... debug message, I've put some more debug in the php extension and I've found that the TransLayer::trans_layer_pool() is going into else statement which is comment with timeout do nothing

int retval = select(fd+1,&rfds,&wfds,&efds,&tv); so the return value of the select is 0, I've check the documentation of select

So select behavior is based on timeout the doc is saying the following

The timeout argument is a timeval structure (shown below) that specifies the interval that select() should block waiting for a file descriptor to become ready. ... If both fields of the timeval structure are zero, then select() returns immediately. (This is useful for polling.)

In our case timeout is define as follow:

struct timeval tv = {0,(int)timeout *1000};

So I add more debug and in fact timeout is NULL so both fields are 0 so select is not waiting and returns immediately. If I hardcode the timeval with struct timeval tv = {5,(int)timeout *1000}; select will wait for maximum 5 seconds and after timeout.

In that case my trace is properly send and my debug report for CLI on local and on test env is the same. In your first comment you said to set pinpoint_php.SendSpanTimeOutMs = 100 which seems to be the proper solution to make things work but when I've search into the code I don't see any declaration of that configuration element in the php ini entry list STD_PHP_INI_ENTRY and anywhere in the code so I'm wondering if that directive still usable ?

eeliu commented 3 years ago

@marty-macfly
please check #350

eeliu commented 3 years ago

pinpoint_php use asynchronous io, so the tv is {0,0}. It only send when available. I rollback pinpoint_php.SendSpanTimeOutMs feature: pinpoint_php.SendSpanTimeOutMs > 0 , waiting until ms expired or event callback.

marty-macfly commented 3 years ago

I'll test that, and revert to you

marty-macfly commented 3 years ago

pinpoint_php use asynchronous io, so the tv is {0,0}. It only send when available. I rollback pinpoint_php.SendSpanTimeOutMs feature: pinpoint_php.SendSpanTimeOutMs > 0 , waiting until ms expired or event callback.

May be in case of CLI when doing remote tcp connection the async io is not done before the process is ended ??

Anyway with the change all my use case are working, so for me that change is fine with new GO version of agent and php-fpm, long running process and cli.

So it would be great to have it in master :)

Thanks a lot

marty-macfly commented 3 years ago

I've seen the merge on master so I'm closing that issue, thanks.