Errors when sending large amounts of data

sai2791 commented 3 years ago

From SteveFosdick. ref: Issue #7

I have found another client problem, though, in that the program I was testing with was failing to set bytes &10 and &11 of the OSFILE control block, i.e. the upper limit of the memory block to be saved. It looks like in this case NFS is sending the length, as calculated by a straight subtraction, but possibly not sending all the data. In one case it send a length of 8523680 (about 8.1Mb) but then sent 83 1K blocks. I can't tell at the moment if that is because of some internal thing within NFS or if the network comms (or the AUN <> Econet state machine) is not so robust and this much data shows it up.

sai2791 commented 3 years ago

I have tried saving the full memory from a second processor using the *save command

*SAVE TEST FFFF0000 FFFFFFFF 0000 0000

this fails at &FA00 bytes written

SteveFosdick commented 3 years ago

What are you testing with as the client? AUN mode or BeebEm mode? I found this issue with B-Em in AUN mode. Also I would be careful about deliberately asking a client to read over its I/O addresses as this could interrupt things independent of the network. In the case of the 2nd proc. reading the tube registers outside of the tube code could well do that.

SteveFosdick commented 3 years ago

Also, while I am actively debugging B-Em as an emulator client, I would not assume the AUN mode in BeebEm is bug free. I recently found an issue where the protocol was stalling because aund is so fast compared to normal 8-bit fileserver that it was sending a reply before NFS had time to switch it's packet receive blocks to deal with that reply.

sai2791 commented 3 years ago

Maybe I am seeing the same thing. I am seeing the client (BeebEm) not completing the save request and only sending part of the file.

debug stuff from aund fs_acornify_name: [./TEMP]->[//TEMP] received scout from 0.110 but payload packet never arrived aund: receive data: Operation timed out fs_error: 0xff/Operation timed out scout ack never arrived from 0.110 aund: Tx reply: Operation timed out

At this stage BeebEm hangs and never recovers but the fileserver is still running and ok.

the portion of the file that was saved seems to be large but not complete i.e. between F000 and FFE0 from the last few days of testing, a packet capture only shows that once aund times out waiting, it sends lots of scouts back.

SteveFosdick commented 3 years ago

Is this in BeebEm mode? Although I started with the code from BeebEm, all the testing I have done so far is in AUN mode.

sai2791 commented 3 years ago

Yes i believe this is in BeebEm mode, I am not sure if the Aund server has AUN mode or how to configure BeebEm or Aund Server to talk in AUN.

SteveFosdick commented 3 years ago

Aund definitely has an AUN mode - it is the default if the aund.conf file does not load a BeebEm-style econet.cfg file.

BeebEm also supports both modes. In its version of the econet.cfg file a line 'AUNMODE 1' will turn it on. That does mean the BeebEm econet.cfg file and the one used by aund are niot 100% compatible, i.e. one can't specify options to BeebEm and then load the same file into aund.

sai2791 commented 3 years ago

The mac version of Beebem is not as upto date with these changes, and does not include AUN mode in econet.cc /cpp a quick diff shows that there are over 100 changes with some large blocks of new code.

SteveFosdick commented 3 years ago

This is part of the reason, as a Linux user, I choose B-Em over BeebEm. BeebEm has been ported to lots of platforms but then new effort has been concentrated on the Windows version and the others are left to wither or be maintained separately with a different set of features.

I was also thinking last night that maybe the BeebEm encapsulation, or maybe a variant of it, is more suitable for connecting devices within the same host and AUN better at connecting between hosts. Maybe the ideal architecture is to have a separate bridging process between the two so that, for example, if one wants to run more than one emulator on one host and maybe a fileserver on a completely different host, the machine with the two emulators would run one copy of the bridging process.

Anyway, that's getting off topic. I'll try some testing in BeebEm mode.

SteveFosdick commented 3 years ago

I just got this, in BeebEm mode:

fs_acornify_name: [.]->[]
payload ack never arrived from 0.4
aund: Tx reply: Connection timed out
)

I think this may be the same issue as I had in AUN mode. The difference is in AUN mode I added a timer in the client so as not to accept a packet from the network before NFS was ready for it but that doesn't happen in BeebEm mode because BeebEm mode doesn`t use the same state machine.

In the client code there is this test:

        // still nothing in buffers (and thus nothing in Econetrx buffer)
        ADLC.control1 &= ~ADLC_CTL1_RX_DISC;   // reset discontinue flag

        // wait for cpu to clear FV flag from last frame received
        if (!(ADLC.status2 & ADLC_STA2_FRAME_VAL)) {

            if (!confAUNmode || fourwaystage == FWS_IDLE || fourwaystage == FWS_IMMSENT || fourwaystage == FWS_DATASENT) {
                // Try and get another packet from network

So the test for frame valid being cleared should wait until the client has completed read the last packet and that would also take account of the small delay for the CRC to arrive after the packet data. I still think the client is relying on the server not being able to respond absolutely immediately, though. I'll try adding a similar timeout to BeebEm mode.

sai2791 commented 3 years ago

I will try to update the mac beebem to the latest code in econet.cpp so that it includes the AUN stuff, fixed the first two issues that were preventing econet from actually starting but beebem is missing key presses at the moment which is driving me nuts.

SteveFosdick commented 3 years ago

I have been doing some testing with both b-em and aund in BeebEm mode, with the client code inherited from BeebEm (windows version). It does seem to be mostly working but my filing system tester for testing OSGBPB seems to be a bit of a torture test. During this test the client and aund frequently get out of sync so the client is discarding whatever aund sends and vice versa. I suspect it may be timing related again but maybe we should better understand how timeout and retry is supposed to work.

P.S. the source for this testers is at: https://github.com/SteveFosdick/FsTest but if you want an SSD to easily run the tests I could provide it. It should work in BeebEm as well as in B-Em.

sai2791 commented 3 years ago

If you could make the a copy of the ssd that would be great the command line tool I used mangled the names and dropped the last letter.

I've gotten to the point where I have the AUN stuff in the Mac version of BeebEm but its not working and failing on something that should work right out of the box, but then the last three issues were like that too. I did get windows BeebEm and AUND running on the Mac to speak AUN to each other so there is progress.

SteveFosdick commented 3 years ago

OK, here's an interesting problem. This is running the OSGBPB test in BeebEm mode (both uand and b-em). I have hacked on the beemem.c module in aund to give some extra messages, including turning the debug printfs into a function that logs an exact timestamp so I can match this to b-em's debug log. Here's a snippet:

01/03/2021 23:12:13.833870396: receive 6 bytes from 127.0.0.1:10101 0.4
01/03/2021 23:12:13.833885037: sending seq=161 4 bytes to to 4 127.0.0.1:10101
01/03/2021 23:12:13.833930592: sending seq=162 6 bytes to to 4 127.0.0.1:10101
01/03/2021 23:12:13.835424051: receive 30 bytes from 127.0.0.1:10101 0.4
01/03/2021 23:12:13.835432755: received wrong-size ack packet (30) from 0.4
aund: send data: No such file or directory

So this looks odd. At time 833870396, it appears aund received a scout from the client* It then responds with an ack at 13.833885037. But what is happening next at 833930592? Is this a scout in the other direction? If so that isn't valid handshake - after you have acknowledged a scout from a peer you wait for the peer to send data.

Then it looks like the client, upon receiving the scout ack, has switched to transmit mode and is not looking for the scout from aund but instead sends the data, timestamp 835424051.

This all seems bizarre, I can't see anything obviously wrong in beebem.c in aund. Unless, of course, the client and server have already got out of step and the messages are not actually responses to each other.

I am also assuming there are no valid data packets that are only six bytes long.

SteveFosdick commented 3 years ago

Ok, here is a disc with the executables. To run, load BCPLROM into a sideways RAM slot, *BCPL then at the '!' prompt one of OSFILE, OSGBPB or BYTEIO. fstest.zip

SteveFosdick commented 3 years ago

Ignore the last SSD, it has a file missing. Try this one: fstest.zip

sai2791 commented 3 years ago

Thanks for the disc, I have run the osgbpb several time and get the error, but I cannot see why. I did make a change that made it slightly better and it even finished twice but it found differences even then. I think I will have to run a packet trace to see if I can spot what is going on from the outside.

SteveFosdick commented 3 years ago

A couple of things that may be interesting. I asked Andrew about retries at https://stardot.org.uk/forums/viewtopic.php?p=311048#p311048 and Andrew replied. You can read Andrew's reply. I think that means the retries in beebem.c:beebem_recv are not as the the protocol is supposed to work, i.e. any failure after the initial scout packet means the four-way handshake is aborted and starts again with a scout.

I think there are client bugs here, certainly at least as far as what I am doing in B-Em (with the code derived from BeebEm) but possibly this is being obscured by aund retrying things. My current thinking is that sometimes the BBC micro is missing incoming packets because it is not ready for them but I think this was being covered up, but only some of the time, by aund re-transmitting.

sai2791 commented 3 years ago

Since we don't know if the error is in the client or in AUND I plan to write a stand alone program that logs on to the aund server and then sends lots of data to see if I can replicate the issue. Basically I am going to throw lots of data at the server and see if I can break it.

SteveFosdick commented 3 years ago

I did fix an issue with a flag-fill timeout in the client and was able to get it working reliably.

On the other hand I still suspect that the way aund handles failure may be a factor that contributes to a single missed packet turning into a complete stall.

sai2791 / aund

Errors when sending large amounts of data #8