Async I/O possibilities

ejona86 commented 3 years ago

An async I/O discussion started at https://github.com/keirf/FlashFloppy/pull/412#issuecomment-753576777 . It was an idea to free up a lot of memory from write_bc, allowing it to be reused for reading. Ideally it would allow us to buffer both sides of an entire track and absorb all write delays, until the track changes.

I've created a basic cooperative scheduler and an fs async API on top. I ported HFE (reading and writing) to async to see how it may look: https://github.com/keirf/FlashFloppy/compare/master...ejona86:io-thread . I implemented HFE in a KISS way. While the HFE code seems to work after some very basic testing, the point is the approach, and not the concrete code organization; lots of the code needs cleaning up.

Based on my flash drives' performance and full speed USB speed, I think ED and TD should be considered out reach for HFE independent of implementation. USB limits read speed of an ED track to at least 66 ms, assuming no overhead and latency. In my testing I think we could read an ED track in ~150 ms which leaves no room for writes. Even 100 ms would push it. We'd need more RAM to support them with HFE. ED should be easily supported with IMG instead. And TD would need us to not load both sides of the track.

I was able to steal enough memory to support 1.44 MB HD floppies with HFE, fully buffering both sides of the track. It is tight as expected, although there's still some options to shift memory around a little bit, like swapping to 8 bit dma to save 2 kB and using that memory elsewhere. I don't know whether we should assume we always have enough memory though. I have an HFE buffering approach in mind that would support HD with HFE without fully-buffering the track, but it would need to continually read the track from flash. Is it worth investing much time to support such a thing? The approach would lend itself to also support changing tracks without waiting for writes to finish committing to flash, although write latencies would still impact it.

Main questions:

Does async seem worth the effort?
Does the implementation approach look half-way sound? Or will we want to do it a substantially different way? I'm more concerned about the threading/io than HFE.
Is it fair to limit ED/TD to IMG? How serious would we be about supporting TD?
How concerned are we about the high memory use of HD floppies with HFE? It is worth letting them work similar to today where they constantly read from flash? (e.g., if we see high value in not constantly reading from flash, then we may keep memory available. Or we may encourage IMG for HD even if the HFE support code was there.)

keirf commented 3 years ago

What can I say, that looks pretty neat for a lashup implementation.

The whole point is to be able to buffer at least a whole side in RAM, right, and preferably both sides of a DS image? If we are having to endlessly stream in from USB, there is no safe time to drop in writes without 'suspending' reads and thus delaying index pulses, I believe? In short, I don't understand your proposed not-fully-buffering scheme: Could you expand on it a bit? My current belief is that this should concentrate on what we can fully buffer, while supporting everything else no worse than we do right now.

What I really care about for full buffering is IMG up to ED, and HFE up to DD. I don't think there's much (any?) HD HFE out there that is too esoteric to support easily with IMG? That said your rejigging of memory isn't that bad, although it screws the alt/logfile build a bit -- I think we can make async-vs-not probably dynamically determined based on available memory, and maybe do less async on the alt/logfile build (albeit it makes the build less representative and thus less useful in that way). [EDIT: Or make it all async but with similar behaviour to what we have now when low on memory?]

Anyway it is basically a good idea. A bunch of code probably needs generalising out of HFE if possible, the buffering would be nice to do in the buffercache layer, integrate that with the async layer a bit more... Well I'm sure you know all this! It would be splendid to lean on the read/write_data special-case buffers much less and the buffercache much more, in general in the existing code :)

There are also probably some corner cases to care about, like not killing the io thread while it's yielding during a USB transaction. Again, I don't doubt you know this.

Yeah... This is awesome overall, thanks! This is the first time I've had someone since the Xen project who can contribute useful heavy lifting, and those guys were payrolled by their companies. It is also very healthy to get other eyes on a sizeable project like this because bad design and implementation decisions tend to fester away, unacknowledged or ignored by the original author.

keirf commented 3 years ago

By the way, is the naked attribute sugar for the kind of bare asm function I construct for call_cancellable_fn?

keirf commented 3 years ago

Oh I like the fact that write-ahead-of-read should be rare (line 477, hfe.c). I was worried about that case but you're right: If readahead is continuing during write, it makes sense that readahead easily keeps ahead of the write cursor, just as it does the read cursor, since they are the same cursor. D'oh. :)

keirf commented 3 years ago

One caveat for your particular area of interest: Of course the writes still race seek operations, the writes are occasionally very slow, and you are going to possibly still have trouble issuing timely index pulses while draining writes (since you have no read data to serve up, and maybe potentially no buffer space to stash writes). I think your hard sector scenario is still hard when your bulk data is parked on crappy Flash behind a slow half-duplex link. But I'm a pessimist. ;)

ejona86 commented 3 years ago

In short, I don't understand your proposed not-fully-buffering scheme: Could you expand on it a bit?

EDIT: Or make it all async but with similar behaviour to what we have now when low on memory

That edit is along the lines I was thinking would be possible, but aggressive reads.

Basically, the approach of aggressively pre-filling reads still works even if we can't buffer an entire track. It just is less-good. Let's say read_data is 36 KB. That's still 150 ms worth of buffer. We could buffer 150 ms of reads which consumes ~60 ms of time to pull from flash. If there were any writes, we have at least 90 ms of time to write back some of it to open up buffer for reading. I view it as two super-imposed circular buffers, a read buffer and a write buffer. As the read advances, that opens up write buffer, and as the write advances it opens up read buffer.

Even with the fully-buffer-track approach, the mental-model of the circular buffer is helpful. It allows us to notice that we could actually seek as soon as there's few enough writes remaining to leave a substantial read buffer. And now hard sectors don't look as glum.

My current belief is that this should concentrate on what we can fully buffer, while supporting everything else no worse than we do right now.

Sounds good. I'm trying to make this as complicated as necessary, but no more.

I don't think there's much (any?) HD HFE out there that is too esoteric to support easily with IMG?

My thoughts as well.

That said your rejigging of memory isn't that bad, although it screws the alt/logfile build a bit

Oh, the log file. Right. That's why it is so large.

There's actually some wiggle room at the moment. write_data is 51076 bytes and HD HFE v2 needs 50176 so there's 900 of change. I had cut into the console before I moved the async cb handling back onto the main thread. Moving the cb handling made me feel comfortable reducing thread1_stack to 512 bytes. I'm still also not quite sure how big write_bc will need to be in the end.

the buffering would be nice to do in the buffercache layer, integrate that with the async layer a bit more... Well I'm sure you know all this!

Actually, I hadn't considered using the buffercache layer. I had been pretty happy to avoid it as it gave me "this is all your memory" "simplicity" and avoided duplicating memory. And that was probably the right thing for the PoC. But now that you mention it I could see how that could work. Devil's in the details, but definitely worth looking in to.

By the way, is the naked attribute sugar for the kind of bare asm function I construct for call_cancellable_fn?

Yeah. It is "let C handle the declaration and let asm do the rest." More precisely, it means "don't emit the function preamble and postamble."

I think your hard sector scenario is still hard when your bulk data is parked on crappy Flash behind a slow half-duplex link. But I'm a pessimist. ;)

Yeah, this is "phase 1" :). Best not to throw hard sectors into the mix yet. Eric looks at his imaginary chart. Yeah, those are phase 3 or later.

The slow, half-duplex, non-pipelined link is excruciating combined with flash latencies. The part that had me more strongly consider this approach is the high read latencies after a write that can occur; that made me reevaluate. That said, my hard sector PR is pretty fair already; it's just so much effort to fight the blocking to track down what is causing latency blips. Especially since printk consumes a lot of CPU and SWD is disabled.

keirf commented 3 years ago

Okay, let me do an answer to minor points, and I'll follow up with what I think are the biggest up-front decisions to be made.

Aggressive writes: You may have 90ms of slack in which to write, but you cannot guarantee every stick will ack an arbitrary write in 90ms. Latency spikes are common. It does depend on stick. It may depend on write size (larger writes may behave better). But I would probably want to make the aggressiveness configurable, and by default I would writeback only when the whole track is buffered, or we're seeking, as these allow us a default behaviour of don't start a new track until writes are done. Which is still a big improvement on FlashFloppy today.
alt/logfile: I wouldn't let this tail wag the dog. It's not used that often. That said, I'm not sure if we really care about 300RPM 500kbps HFE. See next message, as this is probably a big up-front decision.
naked: That's cool, it lets the C compiler see the decl which is an ugly aspect of direct asm coding, particularly for static functions. I will use this. :)
SWD: Well tbh I've never used SWD except for device programming. Is it useful for extracting logging too? Obviously I can hook up gdb, but I don't really use debuggers much. But if there's a way to extract logging on Linux that would be cool! Yeah SWD gets disabled, but not really for any good reason. One pin is used in the corner case chgrst=pa14 in FF.CFG. Certainly we could do a SWD-enabled build if that's handy? And I'd consider code to log through that channel too.

keirf commented 3 years ago

The big decisions to be made.

Buffer cache or ring: I think this the Big One. It informs the whole shape of development really. So my view was that the buffercache would sit on top of the FS stack (it's underneath at the moment, and might still appear there for fs metadata caching) and would be the asynchronous interface. Ie the IO thread would be a buffercache server.

Upsides: Everything is pretty and unified. One copy of data (maybe with very minimal stream buffering on top for some image handlers). Maybe simplifies image handlers by pulling code into a shared subsystem.

Downsides: The buffercache grows many new appendages and gets much more complicated: dirty buffers, locked buffers (probably), ... And we'd still need a conceptual ring in the floppy layer to track how much of a track is fetched (and possibly, where the buffers are). Maybe something unified in floppy subsystem makes more sense than buffercache subsystem. Metadata overhead per block probably higher than a simple contiguous ring. Probably need either de-fragging or we need iovecs down the IO stack to do multi-block IO (the latter wouldn't actually be that bad to do imo).

Do we care about 300RPM 500kbps HFE: This is the largest/tightest case we could care about, and we'd need to implement with the RAM limitation carefully in mind. Compared with not bothering and falling back to something like what we do now (large write ring, so that format-track operations remain reliable). So, who might care? I think most HD HFE use cases are either scenarios where the writeback time doesn't matter (host leaves it to its FDC to time out, which is based on index pulses, which we suppress), or the disk format is standard (can use IMG, see #362).

What about hard sectored images? I suppose the only case that might be 500kbps would be eight-inch MFM. But that would spin at 360RPM and so require only ~42kB of RAM. Much more manageable!

If we don't need to achieve 50kB available, we have a lot more latitude in design and implementation.

ejona86 commented 3 years ago

Aggressive writes. I have no problem with the conservative writes being a default. If we did aggressive writes we'd probably want some display feedback in the event of a buffer underrun.

SWD. No, it doesn't do logging, which means it's only for dire situations. I saw the usage of PA14 and hacked around it but still no dice. It didn't boot with 2<<24 in afio->mapr. There's still plenty of ways I could be messing it up as I've never truly used the thing and the datasheet also mentions a handshake starting with JTAG and transitioning to SWD.

Buffer cache or ring. The current buffer cache under the FS could work for buffering writes, but looks ill suited for reads. It prevents reading from cache while a readahead is ongoing and would require larger buffers in the image code just to satisfy fatfs. This approach means copies of the data into/out of the cache. We could hack around fatfs by putting the volume code into a "fake read" mode where it queues the read and returns fake read data. But there'd be no transparency into the cache and calculating amount of memory necessary for the cache is harder since metadata is mixed in. We'd probably want iovecs.

I think we really want something above the fs. I think a "ring cache" utility is the most direct approach. Most directly what we need, most memory efficient, and most versatile, with the downside of more complicated API interactions. I do think a sector-based approach could work, but I think it'd require copies to/from cache and it might use a ring internally to avoid fragmentation.

Do we care about 300RPM 500kbps HFE. It sounds like you are leaning toward supporting HFE in this case, but not bother with the fully-buffered track approach. That sounds fair; I also though the tight memory was quite restricting without much practical benefit. Since I don't think we want two copies of the HFE code (fully buffered track + current approach), that implies the caching approach will gain some complexity to handle partial tracks. That's fine with me. We can size things such that write_bc + write_data/2 >= track_side_size which allows buffering a track-side's worth of writes while also extending the amount of read buffer.

keirf commented 3 years ago

SWD: Ah that's a shame. I haven't played with SWD except for programming, so I can't authoritatively say what could be going wrong in setting up SWD. I'm pretty sure that AFIO_MAPR field can only be set once until reset though. So if you touched it twice that could be a problem. Anyhow, if we don't know how to do logging via SWD, it's not of much interest to me personally so I guess it's moot.

Cache vs Ring: This has been a useful discussion as I would have probably gone with the superficial attractiveness of a 'unified buffer cache' and hacked it to work, which of course could be done at the cost of time and complexity! I suppose the advantages are that the cache can remember previous tracks, share with metadata, and thus overall be a little more efficient on average. But the average case is after all not what we're really interested in, but rather hitting real-time goals. And a cache is probably not the right tool for that job. So I have to agree with you, and having thought a while about it now I'm increasingly enthusiastic about that choice. It's going to be a lot simpler code too.

When A Cylinder Doesn't Fit: Ok, so specifically we don't care about fitting HD HFE. I think we agree on that. So we just need to take care with fallback behaviour. Sizing the rings as you suggest makes sense. We'll also have to probably synchronously flush writes, at least in conservative-write mode. On all image types (except HFE) we can buffer both sides of a DS disk where it fits, degrade to buffer one side where only that fits (and treat side change like a seek), and something single-sided and HFE-fallback-like (as you suggested) that is really no worse than we have now for giant tracks (some people are configuring giant IMG images for example, for better or worse).

Who Codes It Up: Well, I'm figuring you want the okay to pursue this? It's fine with me if so, we can stick it in a branch and it can aim for a new v4.x release series. It'll need carving up into a bite-sized pieces for review, to some extent. It will also need to go through FF_TestBed automated tests, which I can push it through since you won't have the setup for that. Anyway, nothing surprising here I don't think... If it's going well perhaps I should give you direct access rights to the repo.

ejona86 commented 3 years ago

Cache vs Ring: I've split out the ring logic from HFE into a separate API. It cleans up HFE. I've gotten far enough with the limited memory support that it's clear HFE will be minimally impacted by it.

When A Cylinder Doesn't Fit: My current plan for formats with separate track sides is for the ring reading code to update a "shadow" ring in parallel to the main ring. If the side changes, it can be "swapped in" (by changing pointers or some such; details TBD). If there's not enough memory for both sides, then it just goes back to "single ring only" mode. Having both rings is probably annoying in the ring code, but probably not hard. I'll reevaluate once I see how the code shapes up. (The main alternative to the shadow ring is to interleave the sides.)

Who Codes It Up: I've been cleaning up what I had and have been pushing it to my https://github.com/keirf/FlashFloppy/compare/master...ejona86:io-thread branch. It's honestly looking like it'll be an easy review, other than the massive number of places I could introduce bugs. Be aware that the constrained memory changes are causing me to change essentially every line of the ring cache as I have it use the ring cursors instead of absolute offsets, and I'm swapping things to tracking bytes itself of sectors. So that change will be quite noisy. I'm very glad you have automated testing.

ejona86 commented 3 years ago

FF_TestBed looks very straight-forward. If you have an extra PCB that's the cleanest, but this would be trivial with jumper wires. I have some 1x10 and 1x7 housings, so it wouldn't even be that janky. I should be able to set that up this week.

keirf commented 3 years ago

Okay that all sounds good. To be honest you're obviously fine to crack on and ask questions if you need to. If there are any areas you'd specifically like me to comment on, or image handlers you'd like me to port across to the new environment, let me know.

FF_TestBed is very handy, I designed it to be super simple. No IRQs, no heap, very easy to hack in new tests or diagnostics. So it is out of the box a simple set of regression tests, but you can see on branch speed_test it was easy to hack in logging on read-after-write latencies for comparison with "competing Gotek products". That sort of thing could make it into master branch, as could some tougher regression tests -- the current set are pretty vanilla, they don't bring much Pain, but were purely to pick up on the really dumb bugs I was regularly introducing a year or so back during more active development.

I do have... one or two PCBs... Ok maybe 100 actually. I panellised them and got 10x10cm PCB set done, and these are hardly popular with general FF users so they are my go-to for shimming flat-pack furniture and the like. ;) I'm happy to post you a couple if you email me a shipping address. They shouldn't take too long to get to you and it's a super-neat setup then.

ejona86 commented 3 years ago

I've implemented track-larger-than-ring support and cleaned up the branch. The ring_io code is more complex than I'd like, but about as complex as I expected.

I put an 'async' flag on image_handler so now the arena allocation is auto-tuned and non-async images are supported. But Direct Access is broken if disk_handler is async because write_bc is too small, which blocks FFTestBed testing. So I think adding async versions of `disk*` I/O and DA async support are the main blockers at this point.

keirf commented 3 years ago

That's great! Maybe you can do a broader reset of the rings when switching to DA, to get into non async mode? On exit from DA we always fully remount so that direction is already covered. It's possibly an alternative to implementing more async stuff. Whatever you think is better, neater, etc.

ejona86 commented 3 years ago

Maybe you can do a broader reset of the rings when switching to DA, to get into non async mode?

I'll give that a try. That'd be a more expedient. I've been a little concerned about going into/out-of DA mode and losing any async writes. I feel like it is probably more likely exiting DA mode than entering, so sync I/O is probably a plus. (I've thought I might need a "flush" function or something on image_handler, but haven't really wanted to go down that road unless I have to.)

keirf commented 3 years ago

The entry to DA mode is pretty shoddy really, tucked away in image.c. The aim originally (like, day one of FlashFloppy) iirc was that it would be possible to seamlessly move between DA and non-DA modes. That was killed off because DA mode changes the FAT filesystem and/or selected image file and so we really need a "full reset" on exit. But the entry to DA has forever remained a bit buried. It could at least perhaps be pulled out to floppy.c and may be easier to hack on there?

Also, more broadly, it would be nice to support DA mode even when no stick is inserted, or no image mounted (eg ejected). It feels a bit icky that an image must be mounted to get into DA mode. But that would need a method to force main.c out of navigation mode, and is certainly out of scope of your current work. Still it indicates the direction DA support may go and that what is there currently in terms of integration with the rest of FlashFloppy is not to be greatly respected or tiptoed round.

ejona86 commented 3 years ago

I received the FF_TestBed adapter board and attached male+female connectors to it; pretty easy and very clean. It took me a bit to figure out the test bed wants the DUT to be in Shugart mode and to get the S0/1 jumper right. Testing against master, hfe and adf_test(11) work fine, but dsk hangs and adf_test(22) spams WARN at amiga.c:61: "index.count >= 2" and img_test() spams Long flux @ dma=454 bc=7749: 46658-46370=288 / 36 WARN at floppy_generic.c:53: "TRUE". So I still need to play with it a bit before I can test out my changes.

keirf commented 3 years ago

It can be a bit fussy. I will add the jumper settings to the "Running The Testbed" wiki page, where they properly belong. I will also give it a whirl myself and let you know how I get on.

keirf commented 3 years ago

Well, I remember why I don't run it as often as I should, I always find it a pain to set up. The TestBed and DUT can backpower each other via the ribbon which can be confusing, plus for some reason (perhaps my new PC) I find it really hard to flash via USB-TTL, it takes me loads of attempts. Anyway, I did get it running. Here's a full round:

*** ROUND 11 ***

HFE TEST:
Motor-to-RDY: OFF=6us, ON=200ms
We have 4654 4489s

DSK TEST:
Period: 272 ms ;; GAP3: 84
Index delayed by 22 ms
MFM 8192 r/w sector - OK
Period: 342 ms ;; GAP3: 84
Index delayed by 3 ms
MFM 512 r/w sector - OK
Period: 342 ms ;; GAP3: 84
Index delayed by 3 ms
MFM 512 r/w sector - OK

ADF TEST:
Amiga DD - OK

ADF TEST:
Amiga HD - OK

IMG TEST:
Period: 199 ms ;; GAP3: 84
Index delayed by 3 ms
MFM 512 r/w sector - OK
Period: 199 ms ;; GAP3: 84
Index delayed by 3 ms
MFM 512 r/w sector - OK
Period: 199 ms ;; GAP3: 21
Index delayed by 2 ms
FM 256 r/w sector - OK
Period: 199 ms ;; GAP3: 84
Index delayed by 20 ms
MFM 8192 r/w sector - OK

I think I'm going to solder in a power feed from the TestBed for the DUT. That would be saner.

EDIT: Also the MOTOR-to-RDY line needs a modded Gotek DUT, to wire through the MOTOR signal. A normal Gotek will say MOTOR ignored.

keirf commented 3 years ago

I've revamped the wiki page https://github.com/keirf/FF_TestBed/wiki/Running-The-TestBed

Please let me know if you think anything is still unclear.

ejona86 commented 3 years ago

I got FF_TestBed up and running. I'm seeing one DSK error about expecting 16 sectors, but parts of DSK seem to still be okay and everything else looks fair. This is "good enough" to notice if I break the world. My earlier problem was messing up when preparing the USB drive.

I have no problem programming with my USB-TLL nor ST-Link. The only problem I had was re-learning how to unlock the flash. I've seen the backpowering happen with the Greaseweazle and FluxEngine, so I was prepared for that, but it didn't cause any trouble.

Fixing up image_setup_track looks nontrivial so I'll still need to figure out an approach there. With a huge hack, I am able to test my code and HFE passes. IMG is hanging for MFM 8192, which is surprising; something to look in to.

Huge hack:

diff --git a/src/image/image.c b/src/image/image.c
index 4145296..bcb892d 100644
--- a/src/image/image.c
+++ b/src/image/image.c
@@ -101,6 +101,8 @@ bool_t image_valid(FILINFO *fp)
     return FALSE;
 }

+static bool_t da_mode;
+
 static bool_t try_handler(struct image *im, struct slot *slot,
                           DWORD *cltbl,
                           const struct image_handler *handler)
@@ -120,6 +122,10 @@ static bool_t try_handler(struct image *im, struct slot *slot,
     im->stk_per_rev = stk_ms(200);

     im->disk_handler = im->track_handler = handler;
+    if (da_mode) {
+        im->track_handler = &da_image_handler;
+        im->nr_sides = 1;
+    }

     mode = FA_READ | FA_OPEN_EXISTING;
     if (handler->write_track != NULL)
@@ -259,12 +265,16 @@ bool_t image_setup_track(

 #if !defined(QUICKDISK)
     if (!in_da_mode(im, track>>1)) {
+        da_mode = FALSE;
         /* If we are exiting D-A mode then need to re-read the config file. */
         if (h == &da_image_handler)
             return TRUE;
         h = ((track>>1) >= im->nr_cyls) ? &dummy_image_handler
             : im->disk_handler;
     } else {
+        da_mode = TRUE;
+        if (h != &da_image_handler)
+            return TRUE;
         h = &da_image_handler;
         im->nr_sides = 1;
     }

keirf commented 3 years ago

Good news on the TestBed. Weird you still have some errors, but it's good enough for now. You can see if you break the world and I can also run tests in due course too. And maybe make some diabolical new ones.

Yes it might be better to teach floppy.c and/or floppy_generic.c a bit more about DA and mode switching. Or use a new return code from image_setup_track to remount things within floppy.c without disturbing main.c? I don't really know what the best answer is, but it's not like the current DA hooks in the rest of the code are anything great.

ejona86 commented 3 years ago

I've tested HFE a bit and have resolved the issues I've noticed. But I haven't figured out a good way to give it a good stress test; my cheap USB floppy controller doesn't like FlashFloppy+HFE. I think I'm to a stopping point for this phase. Would you like to review the commits one-by-one, with each having its own PR? Or one large PR? Or just dump it all in a branch? I'll revise my current branch before doing so to make it cleaner.

I reworked the DA handling to have it controlled more by floppy_generic.c. I'm disappointed it wasn't easy to put in floppy.c but it is a bit nicer than it was in image.c. The handling is awkward because DA detection is dependent on how many tracks are in the current image. So I have to load the normal image handler and then replace it with DA when DA is needed. It looks like it'd be just a few lines to enhance it so DA mode can be entered even if no image is mounted.

I noticed that my async vs blocking allocation was very, very wrong and was dereferencing a NULL pointer. This was breaking the IMG test. The allocator now assumes a blocking handler and then reallocates if it turns out the handler is async. Not happy about the circular dependency existing, but the retry is simple.

I noticed switching sides with the ring buffer was really fast when fully buffering and slow otherwise. I tracked it down to read_bc causing the reader to be in the future; when the side changes we need to rewind and re-read what's already been read. read_bc is at least 16 ms (HD) which means 10ms grace period isn't enough cover. I've solved this by limiting the amount buffered in read_bc (to 2kb, which doesn't actually reduce the size) and keeping some already-consumed data around (4kb for HFE).

ejona86 commented 3 years ago

SWD: I learned it has a companion SWO (TRACESWO) which does allow printf-style debugging (not a console, because it is one-way). Unfortunately enabling SWO requires pa15 and pb3 (it uses pb3, but the register configuration requires enabling JTDI as well). The cheap st-link v2 clones also don't support SWO unless you do a hardware mod. So overall no real benefit over serial for us, although for a custom board it might make more sense.

keirf commented 3 years ago

I'm inclined to give you access rights and dump it in a branch? The aim here when ready is to branch master into stable-v3 and then merge your branch into master as new v4.x development series. The sooner we get to that point the better imo. (EDIT: Even if we're not ready at that point to immediately call v4.0a)

Yeah the side switch thing is fun eh? Glad you worked out a scheme for the non-fully-buffered case :)

ejona86 commented 3 years ago

I'm game for the branch approach. I've cleaned up my branch to be ready to be rebased on top of the branch point. I split out https://github.com/keirf/FlashFloppy/pull/439 which is trivial and can happen before the branch point. I took a stab at the copyright notice for the new files; I hope it looks fine to you. Mention it if you have any a concern.

How much do we want to do in v4.x before releasing? Obviously stabilize. But would you be hoping for async IMG or similar to be done before a 4.0 release? I sort of imagine you do since it seems we want an async IMG and it'll need stabilizing, but I don't know the full scope that you might hope for.

keirf commented 3 years ago

Okay I sent you an invite for direct access. I think once you accept that you're set, though I'm not sure what granularity of access control is provided and whether I'll need to tick some boxes. I should have already asked: I assume you're set up with 2FA on your account?

The copyright/written-by notices are perfectly fine. You can feel free to add your name to any files you substantially modify. We could have an AUTHORS or CREDITS file too, though it feels a bit highfalutin for such a small number of contributors.

I'm not precisely sure where we should place the goalposts for v4.0a. I'd really like to get some test from users who have had problems with the existing "dumbio", so that's tickets:

277 : ED IMG
362 : HD IMG, can also test HFE but that's not going to be fully buffered

I can ask some regular users if they have any HFE scenarios with unreliable writes. All the same, even without IMG, drawing a line for 4.0a and getting wider test on it is not a bad idea. It is also then a baseline for our own testing, both correctness and perf.

ejona86 commented 3 years ago

Okay, so async IMG is on the docket. The read/write paths look pretty stand-alone, but we'll need to figure out how we want to handle the sub-512 byte sector alignment. I should have an easier time stress testing IMG as well.

keirf commented 3 years ago

I had a quick look down the list of collaborator privileges, and it looks like there's everything you could plausibly need, so that's good: https://docs.github.com/en/github/setting-up-and-managing-your-github-user-account/permission-levels-for-a-user-account-repository

keirf commented 3 years ago

As far as I'm concerned, you are now welcome to create and manage any branches you need to get your work ready for master. When you're ready, we can branch stable-v3 and open v4 development in master.

ejona86 commented 3 years ago

I've cut the branch and pushed my changes. And then fixed the build I broke (quickdisk). Looks like "make dist" is the thing to run first in the future.

I assume you're set up with 2FA on your account?

Yes, my GitHub account has 2FA. I thought you'd be able to verify that once I accepted the invitation, but it seems that may be limited to github organizations.

keirf commented 3 years ago

Okay, all looks good to me. My strategy for bug fixes suitable for stable-v3 branch is to commit to master and then cherry pick. As I have just done for the fix for #430. That one had already been tested on a branch so I cherry-picked it straight through.

With stable on a separate branch it doesn't matter if master is crazy for a little while. Pretty much it's just you hacking on it right now!

ejona86 commented 3 years ago

I've made what seems like good progress with IMG. I've got the shadow ring implemented and all the async stuff hooked up to IMG. I've left the 1k temp buffer in-place to ease the process. Working on debugging, so I've not pushed it.

I noticed that IMG.CFG allows setting bps to 8192. Since IMG works on units of a sector, that would mean it'd need a write_bc buffer of 32 KB (16 KB + 2 byte and then rounded up to power-of-two) to support 8192. How important is that large of a sector size? It looks like the code could be changed to work on partial sectors, since if CRC fails it just logs (vs aborting the write). Thoughts?

I'm fighting what looks to be some IMG code failing to observe the circular aspect of write_bc, as I get CRC failures for the rest of the sectors after write_bc wraps. It appears it was pre-existing but hidden by the 32 KB write_bc.

I resurrected a (not-retro) PC of mine with an on-board FDC in hopes of using it to stress the HFE/IMG code. Unfortunately it doesn't like reading either format, even before my changes, just like my USB FDC. I'm giving up for the while. I know my USB FDC worked with vanilla Gotek, so I may play with it some more looking for appropriate configuration, but probably not soon.

keirf commented 3 years ago

Sorry I haven't been involved really, as I've been at work on Greaseweazle development and building Greaseweazle devices.

It sounds like progress is good. 8kB sectors are rare but HD formats looking to pack in the most data do use them. So for example, you will see that the XDF handler has 8k sectors in the geometry. I'm not sure if XDF is writable though! I would also not be surprised if some synth(s) out there with HD drives also make use of 8k sectors. In short it should work...

For testing, stressing with TestBed is not a bad idea? We (I?) can add more tests -- XDF, 8k sectors, write stress tests, ... It's also easy to collect timing stats.

keirf commented 3 years ago

Could you give a brief summary of where we are and what remains to be done? And perhaps your thoughts on some good tests? Don't feel obliged to do a big writeup: a quick braindump is fine!

I'm thinking about pulling FatFS into TestBed, and then I can do more real-world style testing of formatting, mixed-workload file io, etc. Say DD, HD, ED IMG, and DD, HD HFE.

ejona86 commented 3 years ago

I found and fixed the blocking bug that was ignoring the buffer mask. I was going to do a bit more testing today and then push the change to swap IMG over to ringio along with shadow ring support. Pushing that change will break sector sizes greater than 512 bytes, but fixing it would be follow-up work.

keirf commented 3 years ago

Oh that sounds great! Will sector sizes smaller than 512 bytes be immediately supported?

I have someone with write errors on an old TSC Flex system who it would be interesting to get to test the new IMG code. But it requires 256 byte sectors.

ejona86 commented 3 years ago

Just finished testing the IMG stuff and further-fixed the bug I thought I had fixed yesterday. It is all pushed now.

Yes, smaller than 512 is fine. The problem with greater than 512 is there will never be enough write_bc buffer available. I think just writing is a problem; reading looks fine because its 16 KB buffer (vs write's 32 KB) was already too limited to fully-buffer the sector.

keirf commented 3 years ago

Cheers. That doesn't look too bad to fix. More continuation-style goodness!

ejona86 commented 3 years ago

I just pushed write support in img for sectors > 512 bytes. It now handles the bytes as the arrive. The img read path should probably be improved at some point to do smaller-than-1024-byte chunks so that the read_bc isn't forced to completely drain before refilling. But I've also not been seeing buffer underruns so it can wait for another day.

ejona86 commented 3 years ago

Actually, FF_TestBed for the 8k.8k.img is showing things broken for reading for 8k sectors. So that'll need to be addressed. Doubling write_bc's size fixes the problem, ~so it probably is working for 4k sectors already~ so it's because the chunk size of 1024+crc can't fit in 2 KB.

a505765f either broke reads a bit or exposed they were already broken and is showing up in FF_TestBed. ~It seems the failure is being triggered by D-A exit seeking causing raw_setup_track() for every track. Putting a delay_ms(16) in raw_setup_track() makes it mostly start passing, which reduces the number of tracks raw_setup_track() is called for.~ (Edit: nope. It was "two wrongs make a right.") I'll continue looking into it.

keirf commented 3 years ago

I had a quick look at D-A exit. I see it's now tested in dma_rd_handle(). Is there a reason to have moved it out of image_setup_track()? It seems it could still be implemented there, and that would cover dma_wr_handle() too (which would be a rather unusual way to detect D-A mode change, but it's nice to have the coverage especially if it's for free).

ejona86 commented 3 years ago

Supporting dma_wr_handle() for D-A would only need a in_da_mode(im, track>>1) != image_in_da_mode(im) condition. Although the write will be thrown away in floppy_cancel() as part of reloading. I also see code in IRQ_STEP_changed() that ignores steps while writing. I guess you could maybe step and then quickly start writing. I didn't explicitly ignore writing; it is just that reading already had D-A-change-detection logic to manage the write protect.

The floppy code in floppy_mount() now determines whether D-A mode should be used. D-A is now mostly a normal handler and this exposes it up to the floppy code so the buffer allocation code knows whether it needs async allocation or not. The logic for enabling D-A in floppy_mount() needs to match the reload logic. Having that split between two modules (floppy and image) is difficult to follow and is quite strange looking to be in image now.

keirf commented 3 years ago

Okay that's probably fine really. Though are you saying that in general floppy_cancel() will discard dirty buffers? Is that good? It seems unlikely: users would probably like their writes to persist if at all possible, and writes should be drained before remount or entering navigation mode.

Does this check save much time during D-A mount by the way? https://github.com/keirf/FlashFloppy/blob/e60cfc84058e533af9a0171b44a3b71f5a295362/src/floppy_generic.c#L245

ejona86 commented 3 years ago

Though are you saying that in general floppy_cancel() will discard dirty buffers? Is that good?

We're talking about a write to a D-A track before we notice we are in D-A. And yes, things like write_bc are dropped. I wouldn't say it is good, but for normal usages of D-A it seems potentially fine.

users would probably like their writes to persist if at all possible, and writes should be drained before remount or entering navigation mode

I have this on a TODO for the entering D-A, although it seemed like it would generally be okay. To solve this I was expecting to put a "sync" function on the image handlers. There's a related issue if a not-present track is seeked to, where the dummy handler will be used and ring_io_progress will stop being called as long as the controller stays on that track.

But as far as navigation mode and such, there was no protection before the async work. If two write batches are enqueued, the first one takes a while, and then a button is pushed, then as soon as the first batch returns the button press will be detected and run_floppy() will return and floppy_cancel() is called. I'd like to see that improved, especially for async, but given it was pre-existing I hadn't worried about it much.

Does this check save much time during D-A mount by the way?

No. It was more about calling out that we only have to do the two-stage load because cylinder 254 is ambiguous.

keirf commented 3 years ago

Oh yes I suppose that's true. The exit on button press is not really worse than it was already, though it depends perhaps how long writes might be buffered for? What's the worst case there I wonder?

I was also worried about the entering and leaving D-A mode, since that also takes us through floppy_cancel().

D-A exit: All D-A writes are synchronous still I believe? So no problem here? We will have completed the writes before dma_rd_handle() is next called.

D-A entry: Does seem a problem. We do a write, then seek to 255. We can call dma_rd_handle() because the image handler's async logic manages the dirty buffers. But dma_rd_handle() detects D-A entry and we end up in floppy_cancel() with buffers yet unwritten. This seems like a case to be dealt with, like with your suggested sync handlers?

The lack of progress while on the dummy handler is a pain. I wonder how to fix that. Maybe empty tracks should be handled by the respective image handlers, but with helper functions to avoid code duplication? EDIT: Could this one be a case of care in using disk_handler vs track_handler?

ejona86 commented 3 years ago

The exit on button press is not really worse than it was already, though it depends perhaps how long writes might be buffered for? What's the worst case there I wonder?

When the track is fully in memory, mainly just the normal write lag time. When ringio has no batches to read it does a write. It tries to do a batched write when possible, but is happy to just write 512 bytes if that's all that's available. This means that it immediately writes 512 bytes at a time as long as the flash is keeping up, but if the flash is slow it will naturally increase the write size shifting toward optimizing for throughput.

HFE with a track that doesn't fit in memory is worse off, because it will be mixing reads with the writes. The reader will stall if the writer is too far behind, so after a rotation either everything is written or we're getting reader underruns.

The lack of progress while on the dummy handler is a pain.

I've not thought about it much, but I think the easiest thing to do is when image_setup_track changes the track_handler image_setup_track should just call the disk_handler's sync handler. We normally sync when changing tracks, after all.

keirf commented 3 years ago

Thanks for all the replies, and sorry for all the questions! :)

When the track is fully in memory, mainly just the normal write lag time. When ringio has no batches to read it does a write. It tries to do a batched write when possible, but is happy to just write 512 bytes if that's all that's available. This means that it immediately writes 512 bytes at a time as long as the flash is keeping up, but if the flash is slow it will naturally increase the write size shifting toward optimizing for throughput.

So we could summarise this mode as reads have strict priority over writes until whole track (cylinder?) is buffered. Plus, sync writes on track (cylinder?) change? No read-after-write delay, no side-switch delay, possibly cylinder-change delay? That sounds perfect.

EDIT: Only one other question on this mode, does it deal elegantly with one side fits in RAM but not both sides? Does it fall back to possible side-switch delay, or does it fall back to the HFE-doesn't-fit behaviour you describe below?

HFE with a track that doesn't fit in memory is worse off, because it will be mixing reads with the writes. The reader will stall if the writer is too far behind, so after a rotation either everything is written or we're getting reader underruns.

Of course we do want this mode, it's necessary for your other work and it's a good idea if you know writes can keep up. However I do not think it should be the default. If the track/cyl won't fit in RAM, I believe the mode should switch to writes have strict priority over reads, and sync at end of write(s). So basically the v3.x behaviour. It avoids underruns and actually works well on most systems which rely on FDC and observing index pulses to trigger timeout.

How hard is this to implement? I'm thinking if you can surface it via a simple static variable or something, I can work out how to plumb it into FF.CFG. I may requisition write-drain= which looks a lot less clever as-is given the new smarter io management.

I've not thought about it much, but I think the easiest thing to do is when image_setup_track changes the track_handler image_setup_track should just call the disk_handler's sync handler. We normally sync when changing tracks, after all.

I see, the sync is hidden within the specific image handlers' setup_track calls, right? So gets missed if we switch out track_handler. It seems clearly a good idea to pull out the sync as a new callback, then you could call it from image.c or floppy.c and get round this issue. There's also a micro-optimisation that you could trigger the sync even during STEP_active during which time setup_track is delayed including your current sync point.

ejona86 commented 3 years ago

Only one other question on this mode, does it deal elegantly with one side fits in RAM but not both sides? Does it fall back to possible side-switch delay, or does it fall back to the HFE-doesn't-fit behaviour you describe below?

Right now this is at the discretion of the image handler, not ringio itself. If you ask ringio to handle more than will fit into memory then it will stream. So I coded img to buffer only one side if the entire cylinder is unable to fit in RAM.

Of course we do want this mode, it's necessary for your other work and it's a good idea if you know writes can keep up.

FWIW, I think we'd expect most hard-sectored disks to fit fully in memory. So my other work isn't all that impacted.

If the track/cyl won't fit in RAM, I believe the mode should switch to writes have strict priority over reads, and sync at end of write(s). ... How hard is this to implement?

For HFE we still want reads to have priority over writes because writes require reads and we have a small write_bc buffer. But yeah, it would be appropriate to sync at the end of writes.

To do this as part of HFE, I think that's just a bit of code in hfe_setup_track to check whether everything fits in memory and if not call ring_io_sync() (independent of whether it is aware that writes are outstanding).To do this as part of ring_io, we'd have the same logic, but as part of ring_io_seek(). Unclear which location is best at the moment, although HFE is in a better position to read configuration options.

The only case that would fail is if another write starts while we are blocked on that sync. That is concerning as I hadn't considered it previously. Since rdata isn't flowing, that seems unlikely for soft-sectored disks. But the same logic implies that two write batches essentially never happen with the old sync code.

I may requisition write-drain= which looks a lot less clever as-is given the new smarter io management.

For the moment, it would seem fair to disable the sync if write-drain=realtime but to do it otherwise. (Edit: hard sectored images would use realtime, FWIW.)

I see, the sync is hidden within the specific image handlers' setup_track calls, right?

The blocking for sync, yes. Although I'll note that periodic calls to ring_io_progress() also works.

There's also a micro-optimisation that you could trigger the sync even during STEP_active during which time setup_track is delayed including your current sync point.

Looks like that is 1ms. Probably doesn't make much difference as there's already an I/O operation in progress. We could maybe add a thread_yield() to dma_rd_handle() or floppy_handle() to aid the operation's progress, but that's as far as seems useful.

ejona86 commented 3 years ago

I have plugged the write holes for entering D-A, swapping to dummy handler, and user navigation. HFE now eagerly writes when the track does not fit in memory, except for write-drain=realtime.

keirf / flashfloppy

Async I/O possibilities #426

277 : ED IMG

362 : HD IMG, can also test HFE but that's not going to be fully buffered