ghaerr / elks

Embeddable Linux Kernel Subset - Linux for 8086
Other
985 stars 106 forks source link

ktcp messages #877

Closed toncho11 closed 3 years ago

toncho11 commented 3 years ago

ktcp prints error messages all the time. This actually prevents me from working with ELKS, because I am interrupted in the middle of a command.

example: tcp: Refusing packet ... :9564 -> 1025

toncho11 commented 3 years ago

Or maybe it is useful because it tells me "BAD CHECKSUM", but then I do not know if it is a real problem or not. I am doing FTP file transfers of ELKS images. Even with these messages the image still boots OK.

ghaerr commented 3 years ago

For various reasons, but mostly because ELKS TCP/IP has just barely become usable, a number of important errors are always displayed, like packets being refused or bad checksums. It was deemed important to know when these happen, and can't currently be turned off. The error messages can alert you that there may be something wrong with the network card, or that TCP/IP has stopped functioning properly. Each error has possibly multiple reasons why it may have occurred.

I presume the errors all happen when ELKS is actually performing a network task, rather than when sitting idly. If so, then that is by design; if not, then perhaps something strange is occurring which needs to be looked further into.

After v0.4.0, some consideration needs to be given as to what errors ktcp should report, when running in the background, versus either erroring silently, or being combined with lots of other debug messages. We don't yet support error levels in ktcp.

Even with these messages the image still boots OK.

A better way would be to compare multiple transfers with cmp, as a block could be dropped and still have the kernel boot (in theory). @Mellvik is still chasing down some hard-to-duplicate transfer errors, although I think most occur only when using telnetd (telnet inbound to ELKS).

toncho11 commented 3 years ago

Messages occur even without transfer. These are: tcp: Refusing packet ...

I made 15 transfers of fd360.bin in a row. There were bad checksum messages. One file was corrupted on the filesystem level on fat32. Windows chkdsk created 123kb in "FOUND.000". The 14 other images were identical with the original. Tested with windows fc /b command.

toncho11 commented 3 years ago

I made a new test and got a kernel error: IMG_20201121_182031

It also said FAT: delete past EOF# when I deleted the downloaded files.

ghaerr commented 3 years ago

Messages occur even without transfer. These are: tcp: Refusing packet ...

These shouldn't happen, unless a previous TCP connection took place, then terminated early for some reason, then an extra "stray" packet came in after ktcp closed the TCP connection. Thats essentially what the message means: a packet was received for a TCP endpoint that is no longer open.

I made 15 transfers of fd360.bin in a row. There were bad checksum messages.

One on each, or 15 all on one transfer? All that will matter when this finally gets analyzed so that it can be duplicated and then debugged. We need to be very precise in our descriptions of errors, or it becomes too hard for me to guess or see what might be wrong.

One file was corrupted on the filesystem level on fat32. Windows chkdsk created 123kb in "FOUND.000".

Hard to say whether this additional problem is because of ktcp perhaps not closing the file properly, a sync not performed, or perhaps a FAT32 problem.

All of these will have to be debugged carefully, starting by eliminating extra variables, like the not using FAT32 but Minix, trying to duplicate consistently, trying to see the bad checksum within the larger TCP debug output context, etc.

We'll keep this open to document these new issues, while hopefully you can eliminate variables and we can get to the bottom of it. This won't be fixed in v0.4.0.

Thank you!

Mellvik commented 3 years ago

Hi @toncho11, It does indeed look like you've triggered a bug somewhere in the net strack. It would be very interesting to see a tcpdump -v of that transfer - can you fix that?

Also beating the system up with ping -s would be great, in particular sizes around the # you have on your screen - 240-290. Ping tests the integrity of the transfer by comparing what's coming back with what was sent. If there is a low level problem (IP or MAC level), ping will detect it.

--Mellvik

  1. nov. 2020 kl. 18:26 skrev toncho11 notifications@github.com:

 I also got a kernel error:

It also said FAT: delete past EOF# when I deleted the downloaded filed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

toncho11 commented 3 years ago

This is the test script, executed from ELKS in order to get an image from my local linux ftp server running Ubuntu:

urlget ftp://user:pass@192.168.1.34:21/image/fd360.bin > /root/fd360_1.bin
urlget ftp://user:pass@192.168.1.34:21/image/fd360.bin > /root/fd360_2.bin
urlget ftp://user:pass@192.168.1.34:21/image/fd360.bin > /root/fd360_3.bin
urlget ftp://user:pass@192.168.1.34:21/image/fd360.bin > /root/fd360_4.bin
urlget ftp://user:pass@192.168.1.34:21/image/fd360.bin > /root/fd360_5.bin
urlget ftp://user:pass@192.168.1.34:21/image/fd360.bin > /root/fd360_6.bin
urlget ftp://user:pass@192.168.1.34:21/image/fd360.bin > /root/fd360_7.bin
urlget ftp://user:pass@192.168.1.34:21/image/fd360.bin > /root/fd360_8.bin
urlget ftp://user:pass@192.168.1.34:21/image/fd360.bin > /root/fd360_9.bin
urlget ftp://user:pass@192.168.1.34:21/image/fd360.bin > /root/fd360_10.bin

You can try it.

Mellvik commented 3 years ago

Messages occur even without transfer. These are: tcp: Refusing packet ...

These shouldn't happen, unless a previous TCP connection took place, then terminated early for some reason, then an extra "stray" packet came in after ktcp closed the TCP connection. Thats essentially what the message means: a packet was received for a TCP endpoint that is no longer open.

Like @ghaerr says, This may be ok. I'm getting this all the time when testing, usually after a reboot because the other side is still trying to revive a broken connection. In any case, Tcpdump will explain what's going on. I made 15 transfers of fd360.bin in a row. There were bad checksum messages.

One on each, or 15 all on one transfer? All that will matter when this finally gets analyzed so that it can be duplicated and then debugged. We need to be very precise in our descriptions of errors, or it becomes too hard for me to guess or see what might be wrong.

I've been using urlget to transfer kernels and utilities into elks since late summer and have never seen a checksum error or a bad/missed packet. I suspect this is a driver level issue, as explained in the previous post.

-Mellvik

toncho11 commented 3 years ago

So connection is not properly closed. The other side still do not know it has already been closed. Checksum errors are probably specific to the SMC driver.

Mellvik commented 3 years ago

So connection is not properly closed. The other side still do not know it has already been closed.

That's it - and now that this part of the stack is stable, this message is most likely on @ghaerr's removal list. Checksum errors are probably specific to the SMC driver.

I think so to, and you being one of two users with that particular card, your help in debugging would be very valuable. Like the tcpdumps I mentioned, and also - when you compare an error free image with one with errors, please use the cmp command to find the byte # where the first error occurs. Also, do that on several images in order to see if the number is consistent. Further - like @ghaerr pointed out, some specifics about your test results would be valuable: such as whether the # of checksum errors is always the same. I understand that out of 15 transfers, one generates checksum errors, the others don't, is that correct?

--Mellvik

toncho11 commented 3 years ago

The images seem binary identical. So no need to fund the first error. But it still prints this message on several transfers.

How do you start tcpdump running in the background and logging to a file?

tcpdump -v 2>&1 >./tcpdump.log &

Is this correct?

Mellvik commented 3 years ago

The thing is, if there is a checksum error, the packet is dropped, and thus the content ignored.

So even the size of the resulting file should be different if the transfer reports CHECKSUM errors. If this does not happen, what you’re seeng is bogus packets, which is of course possible but would sound like a weird condition at the driver level.

SO if you would look further into this it would be appreciated.

—Mellvik

  1. nov. 2020 kl. 12:03 skrev toncho11 notifications@github.com:

The images seem binary identical. So no need to fund the first error. But it still prints this message on several transfers.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-731730413, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGODNW4ULLAUECPCBQPTSRDVXLANCNFSM4T5KRKKA.

toncho11 commented 3 years ago

tcpdump -v 2>&1 >./tcpdump.log &

Is this correct?

File size is correct in general.

Mellvik commented 3 years ago

Yes, this is correct, but 'tcpdump -w file’ is even better. It creates a binary file containing raw packets which we can feed into tcpdump and analyse completely later.

By the way, test tcpdump -v ‘manually’ first, to make sure the command picks up the correct default interface. Otherwise add ‘-i ’, like ‘-i eth0’.

Thank you.

-M

  1. nov. 2020 kl. 12:14 skrev toncho11 notifications@github.com:

tcpdump -v 2>&1 >./tcpdump.log &

Is this correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-731731686, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOA2OH4ELLC4FKLD2ZLSRDXC3ANCNFSM4T5KRKKA.

toncho11 commented 3 years ago

OK... so this is tcpdump on the Ubuntu computer, not in ELKS. I though there is already tcpdump (or a version of it) in ELKS.

toncho11 commented 3 years ago

Here it is:

tcpdump2.log

192.168.1.38 is ELKS receiving a file over FTP. 192.168.1.34 is Ubuntu sending the file.

There were like 40 bad checksum messages LEN 275 printed by ktcp.

Mellvik commented 3 years ago

YEs, that’s right. ELKS does not have a tcpdump (yet), and although the TCP and IP debugging mechanisms would bring us a long way, the output is hard to capture (unless you have a serial console) and the process would be slow. Also tcpdump on Linux provides a lot of detail…

I’m guessing there are no other hosts on your network segment, otherwise some filtering might be necessary.

—Mellvik

  1. nov. 2020 kl. 12:56 skrev toncho11 notifications@github.com:

OK... so this is tcpdump on the Ubuntu computer, not in ELKS. I though there is already tcpdump (or a version of it) in ELKS.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-731736525, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOEMYWSSV7J7USBDZQTSRD4AFANCNFSM4T5KRKKA.

toncho11 commented 3 years ago

If I press ctrl+c during urlget ftp:// it does give a console, but it keeps working. It does not stop: tcpdump keeps printing.

So I now used a filtered tcpdump command: $ sudo tcpdump -i enp0s25 -w ./tcpdump3.log host 192.168.1.38 (192.168.1.38 being the ELKS machine). tcpdump3.log This time there were 4 bad checksum messages len 275 with addresses: (0xfe65) (0xff00) (0xff00) (0x62bd)

Usually the first (after boot) download has no errors. It is always the second one that starts to give errors.

Mellvik commented 3 years ago

Yes, these are good dumps! Thank you.

Are any of them from a transfer that gave checksum errors?

—Mellvik

  1. nov. 2020 kl. 14:05 skrev toncho11 notifications@github.com:

If I press ctrl+c during urlget ftp:// it does give a console, but it keeps working. It does not stop.

So I now used a filtered command: $ sudo tcpdump -i enp0s25 -w ./tcpdump3.log host 192.168.1.38 (192.168.1.38 being the ELKS machine). tcpdump3.log https://github.com/jbruchon/elks/files/5579524/tcpdump3.log — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-731745329, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOFV2IRCJJJEIRTQAO3SREEBFANCNFSM4T5KRKKA.

toncho11 commented 3 years ago

Yes. There were these messages. Also note my previous post where it is always the second download. Also the file seems OK, I just compared it with the original one. It is not the same download (as in the dumps), but still I have the impression that it is the message rather than a real checksum problem.

Mellvik commented 3 years ago

Great @toncho11 - and just to make sure I have understood you correctly:

Are we on? I’ll take a close look at this as soon as I can (hopefully later today).

—Mellvik

  1. nov. 2020 kl. 14:20 skrev toncho11 notifications@github.com:

Yes. There were these messages. Also note my previous post where it is always the second download. Also the file seems OK, I just compared it with the original one. It is not the same download, but still I have the impressions that it is the message rather than a real checksum problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-731747540, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOFYVVZXRXF4254E4ILSREF3VANCNFSM4T5KRKKA.

toncho11 commented 3 years ago

The files contain only one download with 40 and 4 messages respectively. After reboot the first download usually has no error messages. Most later downloads usually have. Ctrl-c seems not to stop the download, even if it gives prompt. Messages count is variable but it is often 4 or around 40. I do the compare on Windows with fc /b command. I never had a difference (not entirely true but for now it is true)

Mellvik commented 3 years ago

OK; thanks. It would be very helpful to have a dump of the first tranfer after boot, the one with no errors.

Also, if you start at tcpdump (dump to your screen), than start a transfer, abort with ^C, get the prompt, for how long does the tranfer seem to continue, and when it stops what is the size of the resulting file?

-Mellvik

  1. nov. 2020 kl. 15:41 skrev toncho11 notifications@github.com:

 The files contain only one download with 40 and 4 messages respectively. After reboot the first download usually has no error messages. All later downloads usually have. Ctrl-c seems not to stop the download, even if it gives prompt. Messages count is variable but it is often 4 or around 40. I do the compare on Windows with fc /b command. I never had a difference (not entirely true but for now it is true)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Mellvik commented 3 years ago

@toncho11, just to make sure we're heading in the right direction - could you add the following command to your script and post the result:

sum *.bin

If you get checksum errors and all 10 outputs from sum are the same, we know we have case of phantom packets. If you also manage to keep track of the number of checksum errors per download, it would be even more helpful.

Thanks.

—Mellvik

  1. nov. 2020 kl. 15:41 skrev toncho11 notifications@github.com:

The files contain only one download with 40 and 4 messages respectively. After reboot the first download usually has no error messages. All later downloads usually have. Ctrl-c seems not to stop the download, even if it gives prompt. Messages count is variable but it is often 4 or around 40. I do the compare on Windows with fc /b command. I never had a difference (not entirely true but for now it is true)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-731758414, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOBIQHUFMW7NN3DR5Z3SREPKXANCNFSM4T5KRKKA.

toncho11 commented 3 years ago

OK. This time there were also bad checksum messages even the first time.

I repeated this 5 times: downloading, erasing, suming. There were always many messages - around 40-50 per download. But the sum for the file was always the same: 28879 720 And file size is always reported correctly: 368 640

Phantom packets or bad message (checksum is not calculated correctly ...)

Mellvik commented 3 years ago

Thanks @toncho11,

I'm assuming the sequence was download-sum-erase - right? :-)

If so, your transfer is indeed fine, and what we're looking for is a bug that cause bogus packets to be passed from (most likely) the driver to IP and onward. The checksum calculation in ELKS (ktcp) is well proven by now, and like I said before, a TCP checksum error cause the packet to be discarded, so the situation seems pretty clear. It would be desirable to put printk's into various positions in the code to get more details, but since you don't have a serial console, that's just going to roll off the screen. I'm going to check the logs for oddities that may trigger this - and possibly take a hard look at the driver - unless @pawosm-arm comes to the rescue.

What may be useful from your end is running ping from the linux machine and see if you can provoke errors. I may have mentioned it before - like ping -i 0.4 -s 247

… or some other packet size in that range.

That will allow you to keep the output and scan for errors.

—mellvik

  1. nov. 2020 kl. 13:12 skrev toncho11 notifications@github.com:

OK. This time there were also bad checksum messages even the first time.

I repeated this 5 times: downloading, erasing, suming. There were always many messages - around 40-50 per download. But the sum for the file was always the same: 28879 720 And file size is always reported correctly: 368 640

Phantom packets or bad message (checksum is not calculated correctly ...)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-732124209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOADRZ5GART4Y3C6KNLSRJGTLANCNFSM4T5KRKKA.

toncho11 commented 3 years ago

Ok this an output of ping:

283 bytes from 192.168.1.38: icmp_seq=31 ttl=64 time=9.87 ms
283 bytes from 192.168.1.38: icmp_seq=32 ttl=64 time=9.87 ms
283 bytes from 192.168.1.38: icmp_seq=33 ttl=64 time=9.86 ms
283 bytes from 192.168.1.38: icmp_seq=34 ttl=64 time=9.85 ms
283 bytes from 192.168.1.38: icmp_seq=35 ttl=64 time=9.94 ms
283 bytes from 192.168.1.38: icmp_seq=36 ttl=64 time=9.93 ms
283 bytes from 192.168.1.38: icmp_seq=37 ttl=64 time=9.96 ms
283 bytes from 192.168.1.38: icmp_seq=38 ttl=64 time=9.87 ms
wrong data byte #210 should be 0xd2 but was 0xff
#8      8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22                                       23 24 25 26 27
#40     28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f                                       40 41 42 43 44 45 46 47
#72     48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f                                       60 61 62 63 64 65 66 67
#104    68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f                                       80 81 82 83 84 85 86 87
#136    88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f                                       a0 a1 a2 a3 a4 a5 a6 a7
#168    a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf                                       c0 c1 c2 c3 c4 c5 c6 c7
#200    c8 c9 ca cb cc cd ce cf d0 d1 ff ff ff ff ff ff ff ff ff ff ff ff ff ff                                       ff ff ff ff ff ff ff ff
#232    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff                                       ff ff ff ff ff ff ff ff
#264    ff ff ff ff ff ff ff ff ff ff ff
283 bytes from 192.168.1.38: icmp_seq=39 ttl=64 time=9.86 ms
283 bytes from 192.168.1.38: icmp_seq=40 ttl=64 time=9.88 ms
283 bytes from 192.168.1.38: icmp_seq=41 ttl=64 time=9.65 ms
283 bytes from 192.168.1.38: icmp_seq=42 ttl=64 time=9.84 ms
283 bytes from 192.168.1.38: icmp_seq=43 ttl=64 time=9.95 ms
283 bytes from 192.168.1.38: icmp_seq=44 ttl=64 time=9.93 ms
283 bytes from 192.168.1.38: icmp_seq=45 ttl=64 time=9.92 ms
283 bytes from 192.168.1.38: icmp_seq=46 ttl=64 time=9.87 ms
283 bytes from 192.168.1.38: icmp_seq=47 ttl=64 time=9.87 ms
283 bytes from 192.168.1.38: icmp_seq=48 ttl=64 time=9.90 ms
283 bytes from 192.168.1.38: icmp_seq=49 ttl=64 time=9.62 ms
283 bytes from 192.168.1.38: icmp_seq=50 ttl=64 time=9.84 ms

And from another one:

255 bytes from 192.168.1.38: icmp_seq=47 ttl=64 time=9.19 ms
255 bytes from 192.168.1.38: icmp_seq=48 ttl=64 time=9.41 ms
255 bytes from 192.168.1.38: icmp_seq=49 ttl=64 time=9.42 ms
255 bytes from 192.168.1.38: icmp_seq=50 ttl=64 time=9.51 ms
255 bytes from 192.168.1.38: icmp_seq=51 ttl=64 time=9.48 ms
wrong data byte #210 should be 0xd2 but was 0xff
#8      8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27
#40     28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47
#72     48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64 65 66 67
#104    68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 80 81 82 83 84 85 86 87
#136    88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7
#168    a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7
#200    c8 c9 ca cb cc cd ce cf d0 d1 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#232    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
255 bytes from 192.168.1.38: icmp_seq=52 ttl=64 time=9.44 ms
255 bytes from 192.168.1.38: icmp_seq=53 ttl=64 time=9.47 ms
255 bytes from 192.168.1.38: icmp_seq=54 ttl=64 time=9.43 ms
255 bytes from 192.168.1.38: icmp_seq=55 ttl=64 time=9.41 ms
255 bytes from 192.168.1.38: icmp_seq=56 ttl=64 time=9.39 ms
255 bytes from 192.168.1.38: icmp_seq=57 ttl=64 time=9.45 ms
255 bytes from 192.168.1.38: icmp_seq=58 ttl=64 time=9.39 ms
255 bytes from 192.168.1.38: icmp_seq=59 ttl=64 time=9.19 ms
255 bytes from 192.168.1.38: icmp_seq=60 ttl=64 time=9.40 ms

No messages in ELKS during the ping.

Mellvik commented 3 years ago

Beautiful, @toncho11 -

exactly what we needed - this is definitely happening in the driver, and now we know what to look for.

If you'd like, there are a couple of packet sizes that might help narrow it down: Test with size 200, that should give you no errors at all.

then test with, say 1200 or even 1400, that would be interesting.

Thank you.

-M

  1. nov. 2020 kl. 14:07 skrev toncho11 notifications@github.com:

Ok this an output of ping:

283 bytes from 192.168.1.38: icmp_seq=31 ttl=64 time=9.87 ms 283 bytes from 192.168.1.38: icmp_seq=32 ttl=64 time=9.87 ms 283 bytes from 192.168.1.38: icmp_seq=33 ttl=64 time=9.86 ms 283 bytes from 192.168.1.38: icmp_seq=34 ttl=64 time=9.85 ms 283 bytes from 192.168.1.38: icmp_seq=35 ttl=64 time=9.94 ms 283 bytes from 192.168.1.38: icmp_seq=36 ttl=64 time=9.93 ms 283 bytes from 192.168.1.38: icmp_seq=37 ttl=64 time=9.96 ms 283 bytes from 192.168.1.38: icmp_seq=38 ttl=64 time=9.87 ms wrong data byte #210 should be 0xd2 but was 0xff

8 8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27

40 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47

72 48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64 65 66 67

104 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 80 81 82 83 84 85 86 87

136 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7

168 a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7

200 c8 c9 ca cb cc cd ce cf d0 d1 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

232 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

264 ff ff ff ff ff ff ff ff ff ff ff

283 bytes from 192.168.1.38: icmp_seq=39 ttl=64 time=9.86 ms 283 bytes from 192.168.1.38: icmp_seq=40 ttl=64 time=9.88 ms 283 bytes from 192.168.1.38: icmp_seq=41 ttl=64 time=9.65 ms 283 bytes from 192.168.1.38: icmp_seq=42 ttl=64 time=9.84 ms 283 bytes from 192.168.1.38: icmp_seq=43 ttl=64 time=9.95 ms 283 bytes from 192.168.1.38: icmp_seq=44 ttl=64 time=9.93 ms 283 bytes from 192.168.1.38: icmp_seq=45 ttl=64 time=9.92 ms 283 bytes from 192.168.1.38: icmp_seq=46 ttl=64 time=9.87 ms 283 bytes from 192.168.1.38: icmp_seq=47 ttl=64 time=9.87 ms 283 bytes from 192.168.1.38: icmp_seq=48 ttl=64 time=9.90 ms 283 bytes from 192.168.1.38: icmp_seq=49 ttl=64 time=9.62 ms 283 bytes from 192.168.1.38: icmp_seq=50 ttl=64 time=9.84 ms

And from another one:

255 bytes from 192.168.1.38: icmp_seq=47 ttl=64 time=9.19 ms 255 bytes from 192.168.1.38: icmp_seq=48 ttl=64 time=9.41 ms 255 bytes from 192.168.1.38: icmp_seq=49 ttl=64 time=9.42 ms 255 bytes from 192.168.1.38: icmp_seq=50 ttl=64 time=9.51 ms 255 bytes from 192.168.1.38: icmp_seq=51 ttl=64 time=9.48 ms wrong data byte #210 should be 0xd2 but was 0xff

8 8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27

40 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47

72 48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64 65 66 67

104 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 80 81 82 83 84 85 86 87

136 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7

168 a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7

200 c8 c9 ca cb cc cd ce cf d0 d1 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

232 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

255 bytes from 192.168.1.38: icmp_seq=52 ttl=64 time=9.44 ms 255 bytes from 192.168.1.38: icmp_seq=53 ttl=64 time=9.47 ms 255 bytes from 192.168.1.38: icmp_seq=54 ttl=64 time=9.43 ms 255 bytes from 192.168.1.38: icmp_seq=55 ttl=64 time=9.41 ms 255 bytes from 192.168.1.38: icmp_seq=56 ttl=64 time=9.39 ms 255 bytes from 192.168.1.38: icmp_seq=57 ttl=64 time=9.45 ms 255 bytes from 192.168.1.38: icmp_seq=58 ttl=64 time=9.39 ms 255 bytes from 192.168.1.38: icmp_seq=59 ttl=64 time=9.19 ms 255 bytes from 192.168.1.38: icmp_seq=60 ttl=64 time=9.40 ms — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-732150126, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOFQO3LUDONBR23RZM3SRJNAHANCNFSM4T5KRKKA.

Mellvik commented 3 years ago

Hi again @toncho11,

there is indeed a bug in the driver, in the wd_pack_get routine.

It does not handle the ring buffer wrap around condition in cases when the packet spans more than one page (256 bytes). In those cases (the packet spans the wrap-around point), the driver will return garbage as the trailing part of the packet.

The header will always be OK as it is in the beginning of the first page of the packet, and the first 210 bytes will always be OK.

Anyway this is an easy fix, @pawosm-arm - would you take a stab at it?

—Mellvik

  1. nov. 2020 kl. 14:57 skrev Helge Skrivervik helge@mymayday.com:

Beautiful, @toncho11 -

exactly what we needed - this is definitely happening in the driver, and now we know what to look for.

If you'd like, there are a couple of packet sizes that might help narrow it down: Test with size 200, that should give you no errors at all.

then test with, say 1200 or even 1400, that would be interesting.

Thank you.

-M

  1. nov. 2020 kl. 14:07 skrev toncho11 <notifications@github.com mailto:notifications@github.com>:

Ok this an output of ping:

283 bytes from 192.168.1.38: icmp_seq=31 ttl=64 time=9.87 ms 283 bytes from 192.168.1.38: icmp_seq=32 ttl=64 time=9.87 ms 283 bytes from 192.168.1.38: icmp_seq=33 ttl=64 time=9.86 ms 283 bytes from 192.168.1.38: icmp_seq=34 ttl=64 time=9.85 ms 283 bytes from 192.168.1.38: icmp_seq=35 ttl=64 time=9.94 ms 283 bytes from 192.168.1.38: icmp_seq=36 ttl=64 time=9.93 ms 283 bytes from 192.168.1.38: icmp_seq=37 ttl=64 time=9.96 ms 283 bytes from 192.168.1.38: icmp_seq=38 ttl=64 time=9.87 ms wrong data byte #210 should be 0xd2 but was 0xff

8 8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27

40 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47

72 48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64 65 66 67

104 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 80 81 82 83 84 85 86 87

136 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7

168 a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7

200 c8 c9 ca cb cc cd ce cf d0 d1 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

232 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

264 ff ff ff ff ff ff ff ff ff ff ff

283 bytes from 192.168.1.38: icmp_seq=39 ttl=64 time=9.86 ms 283 bytes from 192.168.1.38: icmp_seq=40 ttl=64 time=9.88 ms 283 bytes from 192.168.1.38: icmp_seq=41 ttl=64 time=9.65 ms 283 bytes from 192.168.1.38: icmp_seq=42 ttl=64 time=9.84 ms 283 bytes from 192.168.1.38: icmp_seq=43 ttl=64 time=9.95 ms 283 bytes from 192.168.1.38: icmp_seq=44 ttl=64 time=9.93 ms 283 bytes from 192.168.1.38: icmp_seq=45 ttl=64 time=9.92 ms 283 bytes from 192.168.1.38: icmp_seq=46 ttl=64 time=9.87 ms 283 bytes from 192.168.1.38: icmp_seq=47 ttl=64 time=9.87 ms 283 bytes from 192.168.1.38: icmp_seq=48 ttl=64 time=9.90 ms 283 bytes from 192.168.1.38: icmp_seq=49 ttl=64 time=9.62 ms 283 bytes from 192.168.1.38: icmp_seq=50 ttl=64 time=9.84 ms

And from another one:

255 bytes from 192.168.1.38: icmp_seq=47 ttl=64 time=9.19 ms 255 bytes from 192.168.1.38: icmp_seq=48 ttl=64 time=9.41 ms 255 bytes from 192.168.1.38: icmp_seq=49 ttl=64 time=9.42 ms 255 bytes from 192.168.1.38: icmp_seq=50 ttl=64 time=9.51 ms 255 bytes from 192.168.1.38: icmp_seq=51 ttl=64 time=9.48 ms wrong data byte #210 should be 0xd2 but was 0xff

8 8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27

40 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47

72 48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64 65 66 67

104 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 80 81 82 83 84 85 86 87

136 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7

168 a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7

200 c8 c9 ca cb cc cd ce cf d0 d1 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

232 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

255 bytes from 192.168.1.38: icmp_seq=52 ttl=64 time=9.44 ms 255 bytes from 192.168.1.38: icmp_seq=53 ttl=64 time=9.47 ms 255 bytes from 192.168.1.38: icmp_seq=54 ttl=64 time=9.43 ms 255 bytes from 192.168.1.38: icmp_seq=55 ttl=64 time=9.41 ms 255 bytes from 192.168.1.38: icmp_seq=56 ttl=64 time=9.39 ms 255 bytes from 192.168.1.38: icmp_seq=57 ttl=64 time=9.45 ms 255 bytes from 192.168.1.38: icmp_seq=58 ttl=64 time=9.39 ms 255 bytes from 192.168.1.38: icmp_seq=59 ttl=64 time=9.19 ms 255 bytes from 192.168.1.38: icmp_seq=60 ttl=64 time=9.40 ms — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-732150126, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOFQO3LUDONBR23RZM3SRJNAHANCNFSM4T5KRKKA.

toncho11 commented 3 years ago

Yes, no problem with size 200.

I have attached a file with the output with size 270. There is some pattern. The error is every 13th ping. pinglog1.txt

With 1200 I get:

wrong data byte #210 should be 0xd2 but was 0xff
#8      8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27
#40     28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47
#72     48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64 65 66 67
#104    68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c 7d 7e 7f 80 81 82 83 84 85 86 87
#136    88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7
#168    a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7
#200    c8 c9 ca cb cc cd ce cf d0 d1 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#232    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#264    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#296    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#328    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#360    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#392    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#424    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#456    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#488    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#520    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#552    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#584    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#616    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#648    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#680    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#712    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#744    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#776    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#808    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#840    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#872    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#904    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#936    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#968    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#1000   ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#1032   ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#1064   ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#1096   ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#1128   ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#1160   ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
#1192   ff ff ff ff ff ff ff ff

So such a big size is unsupported.

pawosm-arm commented 3 years ago

Anyway this is an easy fix, @pawosm-arm - would you take a stab at it?

@ghaerr I can look at this later this week.

ghaerr commented 3 years ago

@Mellvik, Your ability to debug this without looking at a line of code is impressive! Nice work :)

And @toncho11, I think we're seeing utility in leaving the ktcp debug messages in for a while... they are proving their worth!

Thank you!

toncho11 commented 3 years ago

The tcp: Refusing packet ... is particularly annoying because it interrupts commands or whatever you are doing ... Anyway I can disable it in my build.

ghaerr commented 3 years ago

The tcp: Refusing packet ... is particularly annoying because it interrupts commands or whatever you are doing ...

Actually, another option could be to write these messages to another virtual terminal, such as /dev/tty2 (accessed via Alt-F2). I worry they may never be seen there. We could also perhaps put a "marker" in the top right corner of the screen, much like the current "disk access wheel" that indicates an error message was generated?

pawosm-arm commented 3 years ago

ktcp prints error messages all the time. This actually prevents me from working with ELKS, because I am interrupted in the middle of a command.

The "all the time" part is particularly suspicious for me here considering we both have the same network card. Did you configure I/O memory base (in the driver and in the card) correctly? Did you make sure it does not conflict with any other device? Keep in mind that the address space for hardware devices memory is particularly tight on those XT machines, I had to spend some time to make sure there are no conflicts, in effect, I had to reconfigure this network card using DOS tool EZSETUP.EXE as none of the jumper-selectable setups could result in the collision-free configuration.

Mellvik commented 3 years ago

@toncho11, First - you can switch to tty2 (f2) to get rid of the messages. I think that's more effective than a more general rewrite of printk.

Now, maybe we can figure out the real reason behind these messages. If we leave the checksum problems out (consider it fixed), could you keep a tcpdump active while rebooting elks and see if you can match the messages with something on the wire? If the messages are coming randomly, and not related to actual (previous) connections, we need to find the source and possibly fix a problem in elks.

--M

  1. nov. 2020 kl. 19:13 skrev toncho11 notifications@github.com:  The tcp: Refusing packet ... is particularly annoying because it interrupts commands or whatever you are doing ... Anyway I can disable it in my build.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

toncho11 commented 3 years ago

@pawosm-arm What was the behavior of the card when it did not work? What were the symptoms?

The IRQ and IO base address(PORT) for a SMC card go to: /elks/elks/include/arch/ports.h :

/* wd, wd.c*/
#define WD_IRQ            3
#define WD_PORT         0x280
The RAM base address is configured in: #define WD_SHMEMSEG 0xd000U in /elks/arch/i86/drivers/net/wd.c

which matches the jumpers on the card. But on the card it is D0000 with 4 zeros while in my code above we set 0xD000 with 3 zeros.

ghaerr commented 3 years ago

If the messages are coming randomly, and not related to actual (previous) connections, we need to find the source and possibly fix a problem in elks.

Agreed! There should never be a "refused packet" unless a prior connection has taken place. Now, there could be issues with prior connections being shut down properly (full duplex TCP). So we need to know the exact commands that were previously run after boot, when these messages start occurring. If its just the ftpget program, then that would seem to be the Linux FTP server still may be running after ftpget exit.

ghaerr commented 3 years ago

The "all the time" part is particularly suspicious for me here considering we both have the same network card.

I don't think this is a configuration issue at all. I think perhaps @toncho11 exaggerated a bit and is just seeing "refuse packet" and checksum errors. @Mellvik has figured out a very likely cause for the checksum errors, and we know why refuse packet occurs, we just need to find out which program (inside or outside of ELKS) still has an open connection.

pawosm-arm commented 3 years ago

@pawosm-arm What was the behavior of the card when it did not work? What were the symptoms?

I didn't experience those on ELKS, as first I've made sure it worked stable with TCP programs shipped with FreeDOS. Knowing the card is fine, I started to do work on the ELKS driver.

pawosm-arm commented 3 years ago

@Mellvik though this is not something I should be preoccupied with today, I'd just like you to confirm that this is the problem that you've observed:

                        res = rxhdr->count - sizeof(e8390_pkt_hdr);
                        if (res > len) res = len;
                        fmemcpyb(data, current->t_regs.ds,
                                (char *)hdr_start + sizeof(e8390_pkt_hdr),
                                WD_SHMEMSEG, res);

I assume that the problem occurs when hdr_start points at the last page in the ring and the res is greater than 256 (a page size, sadly not defined as a constant within this driver)?

Anywayz, hopefully, I should be able to do something about it in the second half of this week. I guess the above should be turned into a loop iterating over pages until len data is read (with OUTB(current_rx_page - 1U, WD_8390_PORT + EN0_BOUNDARY) after each read).

toncho11 commented 3 years ago

@pawosm-arm Please tell me about WD_SHMEMSEG. It can not be D0000 with 4 zeros because it is too big, so it is 0xD000 with 3 zeros, right?

pawosm-arm commented 3 years ago

@toncho11

Yes, the addresses printed on this card are all one zero too long

Mellvik commented 3 years ago

@Mellvik https://github.com/Mellvik though this is not something I should be preoccupied with today, I'd just like you to confirm that this is the problem that you've observed:

                    res = rxhdr->count - sizeof(e8390_pkt_hdr);
                    if (res > len) res = len;
                    fmemcpyb(data, current->t_regs.ds,
                            (char *)hdr_start + sizeof(e8390_pkt_hdr),
                            WD_SHMEMSEG, res);

I assume that the problem occurs when hdr_start points at the last page in the ring and the len is greater than 256 (a page size, sadly not defined as a constant within this driver)?

Well, in effect, that's the problem. It doesn't really matter where hdr_start points, what matters is that the packet spans the end of the buffer (i.e. the wrap-around point), so there has to be two read operations instead of one. Anywayz, hopefully, I should be able to do something about it in the second half of this week. I guess the above should be turned into a loop iterating over pages until len data is read (with OUTB(current_rx_page - 1U, WD_8390_PORT + EN0_BOUNDARY) after each read).

Thank you. I created a patch last night for this, will have @toncho11 test it today and you can have a look (and test/fix) when you have time.

—Mellvik

Mellvik commented 3 years ago

@Mellvik https://github.com/Mellvik, Your ability to debug this without looking at a line of code is impressive! Nice work :)

@Thank you, @ghaerr. It occurs to me that this approach - diagnose - understand - fix - is what you've been championing all along :-)

BTW - I suggest an update to the text in menuconfig to indicate the type of WD/SMC card supported.

—Mellvik

[apologies for the occasional effects of an overly zealous spelling helper]

Mellvik commented 3 years ago

Hi @toncho11,

could to apply this patch and see if it fixes the problem? Test with ping and let me know what you get.

Make sure you keep your SHMEMSEG.

I have no means of testing this, so it's kind of a stab in the dark.

—Mellvik

diff --git a/elks/arch/i86/drivers/net/wd.c b/elks/arch/i86/drivers/net/wd.c index 6bc72b6b..88930a2c 100644 --- a/elks/arch/i86/drivers/net/wd.c +++ b/elks/arch/i86/drivers/net/wd.c @@ -5,6 +5,7 @@ /

+#include

include <arch/io.h>

include <arch/ports.h>

include <arch/segment.h>

@@ -269,15 +270,24 @@ static int wd_pack_get(char *data, size_t len) debug_eth("eth: bogus packet: " "status = %#2x nxpg = %#2x size = %d\n", rxhdr->status, rxhdr->next, rxhdr->count);

  1. nov. 2020 kl. 23:24 skrev toncho11 notifications@github.com:

@pawosm-arm https://github.com/pawosm-arm Please tell me about WD_SHMEMSEG. It can not be D0000 with 4 zeros because it is too big, so it is 0xD000 with 3 zeros, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-732460817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOAEXVSLYBWUVHIR64LSRLOIJANCNFSM4T5KRKKA.

toncho11 commented 3 years ago

@Mellvik Can you just put the entire file pls? And then I will update it with my settings. Or point me to the file in your fork?

Are the ping and checksum problem the same?

Mellvik commented 3 years ago

No problem.

—M

  1. nov. 2020 kl. 12:54 skrev toncho11 notifications@github.com:

@Mellvik https://github.com/Mellvik Can you just put the entire file pls? And then I will update it with my settings. Or point me to the file in your fork?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbruchon/elks/issues/877#issuecomment-732904670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOAXEVHJBHFZLFGVD53SRONFZANCNFSM4T5KRKKA.

Mellvik commented 3 years ago

Sorry - i think the attachment was filtered. Here we go again. wd.zip

pawosm-arm commented 3 years ago

@Mellvik I've just tried wd.c above. It does not break anything, it does not change anything, yet in my case, the driver always worked, so I can only confirm that at least there's no regression.