perfsonar / owamp

A tool for performing one-way or two-way active measurements
Apache License 2.0
70 stars 30 forks source link

TWAMP - Question Invalid reflected sender sequence number #66

Open pierl2000 opened 4 years ago

pierl2000 commented 4 years ago

Hi, See below an example where the packet count is set to: -c 10 and the number of packets sent is not the same number. Could you please let me what's your opinion about it? I already looked at the code and still and I do not know the root cause.. It could be software either network udp.. Thank you

/usr/local/bin/twping -S aaa.bbb.ccc.ddd eee.fff.ggg.hhh:862 -c 10 Approximately 3.1 seconds until results available twping: FILE=endpoint.c, LINE=3633, Invalid reflected sender sequence number!

Directional delays may be inaccurate due to out of sync clocks!

--- twping statistics from [aaa.bbb.ccc.ddd]:9828 to [eee.fff.ggg.hhh]:9828 --- SID: 86828b48371a71e2fd4ae7db76a58962 first: 2020-05-21T15:18:16.942 last: 2020-05-21T15:18:18.062 6 sent, 0 lost (0.000%), 0 send duplicates, 6 reflect duplicates round-trip time min/median/max = 19/19/407 ms, (err=8.73 ms) send time min/median/max = 40.7/40.8/40.8 ms, (err=4.36 ms) reflect time min/median/max = -21.8/-21.8/366 ms, (err=4.36 ms) reflector processing time min/max = 0.000954/0.000954 ms two-way jitter = 0.1 ms (P95-P50) send jitter = 0.1 ms (P95-P50) reflect jitter = 0 ms (P95-P50) send hops = 4 (consistently) reflect hops = 4 (consistently)

mfeit-internet2 commented 4 years ago

The error message indicates that the responder sent back a sequence number greater than the highest one sent. Twping considers that an error and stops, which is why you didn't get the full set of sent packets.

If there are no problems, the full count of packets is sent:

me@psdev1$ twping xyzzy -c 10
Approximately 4.2 seconds until results available

--- twping statistics from [psdev1]:8963 to [xyzzy]:19405 ---
SID:    c6809719e27137475efefe6fb4f91704
first:  2020-05-21T17:22:16.687
last:   2020-05-21T17:22:17.331
10 sent, 10 lost (100.000%), 0 send duplicates, 0 reflect duplicates
round-trip time min/median/max = nan/nan/nan ms, (err=34.2 ms)
send time min/median/max = nan/nan/nan ms, (err=34.2 ms)
reflect time min/median/max = nan/nan/nan ms, (err=34.2 ms)
reflector processing time min/max = nan/nan ms
two-way jitter = nan ms (P95-P50)
send jitter = nan ms (P95-P50)
reflect jitter = nan ms (P95-P50)
send TTL not reported
reflect TTL not reported

What implementation of TWAMP is running on the far end?

pierl2000 commented 4 years ago

Hi, Thank you for the quick answer. Twampd from JunOs. twping Linux. I attached a screenshot from a trace that was configured with a -c 10. You will see in the trace that a stop session packet is sent by the client before the next seq answer packet from the server instance that we can see in the trace. I don't know why because time to time it pass 10/10 packets and some time it fails somewhere between the packet 1 and the packet 10. both source use same single ntp server. I thought it could be a udp packet being dropped but we see the next seq packet in the trace. So the packet are not dropped. Could it because a maximum possible estimated latency delay reached that triggers a stop control packet from the client to the server? A kind of timeout even if the latency seems low.. Thank you for your time. twamp_to_junos_session_stopped_truncate_202005211716

pierl2000 commented 4 years ago

Hi, I found something interesting, The tests are good when we slow down the test with the "-i 1e" or "-i 1f". I still do not understand why the stop control packet is sent out by the client prior to receive a reflected packet at anytime in a testing cycle. Thank you /usr/local/bin/twping -S a.b.c.d e.f.g.h:862 -i 1e -c10 Approximately 12.1 seconds until results available

Directional delays may be inaccurate due to out of sync clocks!

--- twping statistics from [a.b.c.d]:9692 to [e.f.g.h]:9692 --- SID: 86828b4806b971e280ef363767047d50 first: 2020-05-22T02:35:52.392 last: 2020-05-22T02:36:02.230 10 sent, 0 lost (0.000%), 0 send duplicates, 7 reflect duplicates round-trip time min/median/max = 19/19.1/1.56e+03 ms, (err=8.67 ms) send time min/median/max = -557/-557/-556 ms, (err=4.33 ms) reflect time min/median/max = 575/576/2.12e+03 ms, (err=4.33 ms) reflector processing time min/max = 0.000954/0.000954 ms two-way jitter = 0 ms (P95-P50) send jitter = 0.4 ms (P95-P50) reflect jitter = 0.2 ms (P95-P50) send hops = 4 (consistently) reflect hops = 4 (consistently)

/usr/local/bin/twping -S a.b.c.d e.f.g.h:862 -i 1f -c10 Approximately 12.1 seconds until results available

Directional delays may be inaccurate due to out of sync clocks!

--- twping statistics from [a.b.c.d]:8760 to [e.f.g.h]:8760 --- SID: 86828b4828b971e27555a076ff28e039 first: 2020-05-22T02:36:26.587 last: 2020-05-22T02:36:35.587 10 sent, 0 lost (0.000%), 0 send duplicates, 9 reflect duplicates round-trip time min/median/max = 18.9/19/1e+03 ms, (err=8.67 ms) send time min/median/max = -559/-558/-558 ms, (err=4.33 ms) reflect time min/median/max = 577/577/1.56e+03 ms, (err=4.33 ms) reflector processing time min/max = 0.000954/0.000954 ms two-way jitter = 0.1 ms (P95-P50) send jitter = 0.2 ms (P95-P50) reflect jitter = 0.3 ms (P95-P50) send hops = 4 (consistently) reflect hops = 4 (consistently)

/usr/local/bin/twping -S a.b.c.d e.f.g.h:862 -i 1.0e,0f -c20 Approximately 12.1 seconds until results available twping: FILE=stats.c, LINE=290, Invalid type 1 seq number request twping: FILE=stats.c, LINE=1384, IterateSummarizeTWSession: Unable to fetch reflect packet #16 twping: FILE=stats.c, LINE=1785, OWPStatsParse: iteration of twoway data records failed twping: FILE=owping.c, LINE=437, OWPStatsParse: failed twping: FILE=owping.c, LINE=2039, do_stats("from" session)

mfeit-internet2 commented 4 years ago

You will see in the trace that a stop session packet is sent by the client before the next seq answer packet from the server instance that we can see in the trace. I don't know why because time to time it pass 10/10 packets and some time it fails somewhere between the packet 1 and the packet 10. both source use same single ntp server.

This isn't a time sync, delay, drop or ordering issue. I've checked the twping source code, and the cause of the error message and subsequent termination of the test is arrival of a packet reflected by the server (your JunOS device) containing a sequence number that the client (twping) hasn't sent yet.

I still do not understand why the stop control packet is sent out by the client prior to receive a reflected packet at anytime in a testing cycle.

In the traffic dump, there are six packets sent forward and six sent back. Twping appears to have a problem with the sequence number in the fifth one returned, at which point it ends the test and sends a stop session in the process. The sixth sent packet was already in flight when this happened, but the server had probably turned it around and put the return volley back on the wire before it saw the stop session. Barring drops, that packet was going to show up on the client's network interface even if twping had already thrown in the towel.

It would be instructive to dump the contents of the measurement packets and examine the sequence numbers in both directions. My money is on six packets going out with sequential numbers, four returned with sequential numbers and a fifth that breaks the sequence. If that is the case, the fact that it happens intermittently at the default rate and not at all when you slow the rate says that you may be exercising a bug in the JunOS implementation.

pierl2000 commented 4 years ago

Hi Mark, Thank you again for your answer. I don't understand yet the issue because the sequence number order seems to be good however when we look at the second field after the sequence number field which is "Timestamp" we can see that the reflected packet timestamp is earlier by few milliseconds and maybe it could explain the issue. Both nodes are supposed to be sync on the exact same uniq NTP server.. I just anonymized the trace with TraceWrangler. Now I try to upload the pcap file on github. Thank you

pierl2000 commented 4 years ago

Hi Mark, Fyi I sent you in pv the trace I talked about. I continued to investigate the issue and I've another clue about the sync time issue I talked about.

See below a first example of the issue we talk about. Using the -v we can see the big negative fwd_delay. The second example that is stable and always 10/10 packets use another Linux box toward another JunOS. You will see the positive fw_delay close to 0. I go continue to investigate. I'm now wondering if I have an NTP client issue. I know that the two nodes are supposed to be configured to the same NTP server so I may need to review the ntp config to try to force it somehow to use a more common accurate time(ms) Thank you

/usr/local/bin/twping -S a.b.c.d e.f.g.h:862 -c10 -v Approximately 3.0 seconds until results available

--- twping statistics from [a.b.c.d]:9363 to [e.f.g.h]:9363 --- SID: xxxxxxxxx seq_no=0 fwd_delay=-267 ms bck_delay=275 ms delay=8.11 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=0 fwd_delay=-267 ms bck_delay=427 ms delay=160 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=1 fwd_delay=-267 ms bck_delay=275 ms delay=8.1 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=1 fwd_delay=-267 ms bck_delay=281 ms delay=14 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=2 fwd_delay=-267 ms bck_delay=275 ms delay=8.09 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=2 fwd_delay=-267 ms bck_delay=433 ms delay=166 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=3 fwd_delay=-267 ms bck_delay=275 ms delay=8.08 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=3 fwd_delay=-267 ms bck_delay=444 ms delay=177 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=4 fwd_delay=-267 ms bck_delay=275 ms delay=8.09 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=4 fwd_delay=-267 ms bck_delay=306 ms delay=39.3 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=5 fwd_delay=-267 ms bck_delay=275 ms delay=8.1 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=5 fwd_delay=-267 ms bck_delay=326 ms delay=59.6 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=6 fwd_delay=-267 ms bck_delay=275 ms delay=8.08 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=6 fwd_delay=-267 ms bck_delay=281 ms delay=13.8 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=7 fwd_delay=-267 ms bck_delay=275 ms delay=8.1 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=7 fwd_delay=-267 ms bck_delay=473 ms delay=206 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=8 fwd_delay=-267 ms bck_delay=275 ms delay=8.1 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=8 fwd_delay=-267 ms bck_delay=313 ms delay=46.4 ms proc_delay=0.000954 ms (err=1.05 ms) seq_no=9 fwd_delay=-267 ms bck_delay=275 ms delay=8.08 ms proc_delay=0.000954 ms (err=1.05 ms)

Another node that tests torward another JunOS. You will see the positive fwd_delay and it's close to 0 /usr/local/bin/twping -S i.j.k.l m.n.o.p:862 -c10 -v Approximately 3.0 seconds until results available

--- twping statistics from [i.j.k.l]:9147 to [m.n.o.p]:9147 --- SID: yyyyyyyyyyyyy seq_no=0 fwd_delay=0.832 ms bck_delay=-0.673 ms delay=0.158 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=0 fwd_delay=0.832 ms bck_delay=393 ms delay=393 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=1 fwd_delay=0.824 ms bck_delay=-0.675 ms delay=0.149 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=1 fwd_delay=0.824 ms bck_delay=16.9 ms delay=17.8 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=2 fwd_delay=0.848 ms bck_delay=-0.677 ms delay=0.172 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=2 fwd_delay=0.848 ms bck_delay=13.2 ms delay=14 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=3 fwd_delay=0.828 ms bck_delay=-0.676 ms delay=0.152 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=3 fwd_delay=0.828 ms bck_delay=216 ms delay=217 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=4 fwd_delay=0.84 ms bck_delay=-0.674 ms delay=0.166 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=4 fwd_delay=0.84 ms bck_delay=54.6 ms delay=55.4 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=5 fwd_delay=0.851 ms bck_delay=-0.675 ms delay=0.176 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=5 fwd_delay=0.851 ms bck_delay=164 ms delay=165 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=6 fwd_delay=0.848 ms bck_delay=-0.672 ms delay=0.176 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=6 fwd_delay=0.848 ms bck_delay=76.3 ms delay=77.1 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=7 fwd_delay=0.85 ms bck_delay=-0.673 ms delay=0.177 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=7 fwd_delay=0.85 ms bck_delay=136 ms delay=137 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=8 fwd_delay=0.844 ms bck_delay=-0.673 ms delay=0.171 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=8 fwd_delay=0.844 ms bck_delay=124 ms delay=125 ms proc_delay=0.000954 ms (err=0.587 ms) seq_no=9 fwd_delay=0.893 ms bck_delay=-0.745 ms delay=0.148 ms proc_delay=0.000954 ms (err=0.587 ms)

When we will fix the current issue the next issue will be to explain the delay spike. Maybe a kind of QOS on the JunOS could explain it..

pierl2000 commented 4 years ago

Hi, I now understand a part of the issue. This is a software issue. I don't know yet if this came from old Linux kernel of twping or both. The issue happens when twping is executed from an old Linux kernel (example 2.6.32 Redhat6) toward any twampd with a real network latency distanced by example 20ms of delays then twping start to be unstable and it randomly stop between any pair of packet(even at packet 0 exchange). The issue also creates duplicate packets and it also create a wrong max latency value. See below an example of the issue from RHEL6 or probably any other kernel around 2.6.32.

./twping -S a.b.c.d e.f.g.h -c10 -v Approximately 4.2 seconds until results available twping: FILE=endpoint.c, LINE=3633, Invalid reflected sender sequence number!

--- twping statistics from [a.b.c.d]:9443 to [e.f.g.h]:9443 --- SID: c0a8814be27a7a68744d59b1f46a5665 seq_no=0 fwd_delay=-321 ms bck_delay=352 ms delay=30.2 ms proc_delay=0.0901 ms (err=31.1 ms) seq_no=0 fwd_delay=-321 ms bck_delay=404 ms delay=82.5 ms proc_delay=0.0901 ms (err=31.1 ms) seq_no=1 fwd_delay=-322 ms bck_delay=355 ms delay=33.4 ms proc_delay=0.0167 ms (err=31.1 ms) seq_no=1 fwd_delay=-322 ms bck_delay=465 ms delay=143 ms proc_delay=0.0167 ms (err=31.1 ms) seq_no=2 fwd_delay=-321 ms bck_delay=361 ms delay=40.1 ms proc_delay=0.0157 ms (err=31.1 ms) seq_no=2 fwd_delay=-321 ms bck_delay=616 ms delay=295 ms proc_delay=0.0157 ms (err=31.1 ms) seq_no=3 fwd_delay=-322 ms bck_delay=350 ms delay=28 ms proc_delay=0.021 ms (err=31.1 ms) seq_no=3 fwd_delay=-322 ms bck_delay=431 ms delay=108 ms proc_delay=0.021 ms (err=31.1 ms) seq_no=4 fwd_delay=-322 ms bck_delay=368 ms delay=45.7 ms proc_delay=0.0162 ms (err=31.1 ms) seq_no=4 fwd_delay=-322 ms bck_delay=391 ms delay=68.9 ms proc_delay=0.0162 ms (err=31.1 ms) seq_no=5 fwd_delay=-322 ms bck_delay=349 ms delay=26.4 ms proc_delay=0.0162 ms (err=31.1 ms) seq_no=5 fwd_delay=-322 ms bck_delay=413 ms delay=91 ms proc_delay=0.0162 ms (err=31.1 ms)

Directional delays may be inaccurate due to out of sync clocks!

--- twping statistics from [a.b.c.d]:9443 to [e.f.g.h]:9443 --- SID: c0a8814be27a7a68744d59b1f46a5665 first: 2020-05-28T17:59:06.034 last: 2020-05-28T17:59:07.073 6 sent, 0 lost (0.000%), 0 send duplicates, 6 reflect duplicates round-trip time min/median/max = 26.4/30.2/295 ms, (err=31.1 ms) send time min/median/max = -322/-322/-321 ms, (err=15.5 ms) reflect time min/median/max = 349/352/616 ms, (err=15.5 ms) reflector processing time min/max = 0.0157/0.0901 ms two-way jitter = 15.6 ms (P95-P50) send jitter = 1.7 ms (P95-P50) reflect jitter = 16.6 ms (P95-P50) send hops = 1 (consistently) reflect hops = 1 (consistently)

I confirmed the issue don't happens when executed from kernel 4.18. I used the same physical box I used with my previous rhel6 test but that time with RHEL8. The results were perfect with kernel 4.14.

About the current issue with old kernel. I found a workaround that fix the 3 issues(the invalid sequence that initiate a stop control packet at anytime, the wrong maximum latency and the wrong duplicates value. The solution is to add the following: "-i 0f" take note that if you use something > 0 with the "-i" then the duplicate issue and the wrong max latency will happens again but the invalid sequence issue will stop. So -i 0f is the workaround.

See below an example using the work around from a rhel6 box using old kernel 2.6.32

./twping -S a.b.c.d e.f.g.h -c10 -v -i 0f Approximately 3.0 seconds until results available

--- twping statistics from [a.b.c.d]:9853 to [e.f.g.h]:9853 --- SID: c0a8814be27a7c2774dad06cbeb76928 seq_no=0 fwd_delay=-325 ms bck_delay=370 ms delay=45.5 ms proc_delay=0.16 ms (err=30.9 ms) seq_no=1 fwd_delay=-325 ms bck_delay=370 ms delay=45.6 ms proc_delay=0.0143 ms (err=30.9 ms) seq_no=2 fwd_delay=-325 ms bck_delay=370 ms delay=45.6 ms proc_delay=0.0129 ms (err=30.9 ms) seq_no=3 fwd_delay=-325 ms bck_delay=371 ms delay=45.9 ms proc_delay=0.0129 ms (err=30.9 ms) seq_no=4 fwd_delay=-325 ms bck_delay=371 ms delay=45.9 ms proc_delay=0.0129 ms (err=30.9 ms) seq_no=5 fwd_delay=-324 ms bck_delay=370 ms delay=45.9 ms proc_delay=0.0138 ms (err=30.9 ms) seq_no=6 fwd_delay=-324 ms bck_delay=370 ms delay=45.8 ms proc_delay=0.0129 ms (err=30.9 ms) seq_no=7 fwd_delay=-324 ms bck_delay=370 ms delay=45.8 ms proc_delay=0.0119 ms (err=30.9 ms) seq_no=8 fwd_delay=-324 ms bck_delay=370 ms delay=45.7 ms proc_delay=0.0129 ms (err=30.9 ms) seq_no=9 fwd_delay=-324 ms bck_delay=370 ms delay=45.7 ms proc_delay=0.0119 ms (err=30.9 ms)

Directional delays may be inaccurate due to out of sync clocks!

--- twping statistics from [a.b.c.d]:9853 to [e.f.g.h]:9853 --- SID: c0a8814be27a7c2774dad06cbeb76928 first: 2020-05-28T18:06:32.833 last: 2020-05-28T18:06:32.833 10 sent, 0 lost (0.000%), 0 send duplicates, 0 reflect duplicates round-trip time min/median/max = 45.5/45.8/45.9 ms, (err=30.9 ms) send time min/median/max = -325/-325/-324 ms, (err=15.4 ms) reflect time min/median/max = 370/370/371 ms, (err=15.4 ms) reflector processing time min/max = 0.0119/0.16 ms two-way jitter = 0.2 ms (P95-P50) send jitter = 0.8 ms (P95-P50) reflect jitter = 0.9 ms (P95-P50) send hops = 1 (consistently) reflect hops = 1 (consistently)

Thank you

mfeit-internet2 commented 4 years ago

Good sleuthing.

I had a look at the trace you sent, and there doesn't seem to be a path through the code that would cause the complaint about the sequence number. The code maintains a counter of the highest number it's sent out that's incremented immediately after the packets are sent. The code doing the complaining would have to be seeing a number greater than that, which would mean the value is getting clobbered somewhere or packets are being reflected from the future. :-)

It would be instructive to know what number twping is seeing beyond what the network saw. If you'd be amenable to compiling and trying a modified version of the code, adding the following at line 3633 (see link above) would do the trick:

printf("Unexpected sequence number %lu; highest sent was %lu\n", twdatarec.sent.seq_no, i);

We dropped support for EL6 in 2018, but if this is something unrelated to that, I'd like to know about it.

larsch commented 3 years ago

Confirmed on owamp 4.3.3:

twping: FILE=endpoint.c, LINE=3633, Invalid reflected sender sequence number!

Client running twping is armv7l (raspberry pi) while server running twampd is x86_64. Both running Linux 5.10. Capture: twping-failed.cap.gz

Reverse twping (from x86_64 to armv7l) works fine. Capture: twping-success.cap.gz

tritamk14 commented 3 years ago

Hi all, I have two Juniper devices to measure with TWAMP. How can I run this tool in a Linux server to sent start/stop test requests to Client-Sender? This means the server does not run a test session with the Juniper device, only one Juniper device receives that request and runs a test session with another Juniper device.

mfeit-internet2 commented 3 years ago

@tritamk14 Please don't add unrelated questions to issues. This would be a good topic for the perfsonar-user mailing list, which is the most-relevant thing we have for discussing twamp.