keirf / flashfloppy

Floppy drive emulator for Gotek hardware
Other
1.33k stars 193 forks source link

ADF write corruption #32

Closed jamarju closed 6 years ago

jamarju commented 6 years ago

I'm getting sector checksum errors when writing to an ADF image. Sorry for being so vague, I will try some way to reproduce it consistently. So far it's happened while saving a civilization game but also just copying files over from Workbench. If there is useful data from the firmware out to the serial port I can try and dump that too. Also THANKS for building this, this software is a godsend (therefore you must be a god).

bstrobel commented 6 years ago

I noticed something similar recently. I copied real floppies using the internal floppy disk drive to a Gotek connected as the second/external drive on an A500. I used the Workbench 1.3 diskcopy utility from the command line. Everything seemed to work fine until I tried to use the copies from the disks. I get disk corrupted errors (I do not remember the exact error message at the moment, need to check) for many of them but not all.

I have programming skills in C, an oscilloscope, a logic analyzer and a ST_LINK/V2-1 debugger/programmer from a Nucleo-64 board. The only thing I lack is time to really dive into it. But I hope this is better during the Christmas holidays. Happy to get your advice @keirf where to start and how to approach the issue.

Anyway thanks for the cool project and providing it for free!

keirf commented 6 years ago

Always worth trying another USB stick.

bstrobel: First port of call is to make a debug build and then you get serial logging. This will tell you, for example, whether any writes were missed (which, if writing to an ADF, will generally mean your flash drive is very slow on some writes).

bstrobel commented 6 years ago

Ok, will try this. Will take some time since I have to set up the tool chain first. I'll come back once I have it up and running.

bstrobel commented 6 years ago

Sorry, that I need to ask, but how to make a debug build? I have set up the dev env and have also executed a "make dist" successfully. But how do I tell make to build a debug version of FlashFloppy?

keirf commented 6 years ago

debug=y make -j8 gotek

You can adjust the baud rate at the top of src/console.c (default is 3Mbaud)

jamarju commented 6 years ago

Here is the log file of FlashFlopy 0.95a while copying a file that resulted in checksum errors. FlashFloppy compiled with debug=y make gotek and baudrate = 230400 in src/console.c:

ff.log

Aside the normal "Write x/y" lines, I noticed these:

New max: read_us=1809
New max: read_us=3532
New max: read_us=4607
New max: read_us=4906
New max: read_us=8486
Trk 86: late 534us
Trk 102: late 576us
Trk 104: late 302us
Trk 116: late 769us
Trk 128: late 945us
Trk 156: late 19us
Trk 6: late 112us
New max: read_us=30309
Trk 36: late 222us
Trk 42: late 524us
Trk 44: late 774us
Trk 38: late 1032us

An attempt to read the copied file (it's a lha file so I'm just doing a lha t file.lha to test for errors) results in:

"checksum error on disk block 980".

keirf commented 6 years ago

Can you attach the damaged ADF? There is nothing obvious in the log. All the Write x/y lines appear to come in groups of 11 (as expected), and there are no warnings about bad headers or missed writes. This might therefore indicate that writes are getting into Gotek RAM okay but writeback to USB is going wrong...

jamarju commented 6 years ago

Here it is:

damaged_adf.zip

jamarju commented 6 years ago

Okay, so I extracted the test file from the adf ignoring the checksum errors, hexdumped it and took a diff -C 4 GOOD.hex BAD.hex> good-bad.txt, the results of which are: good-bad.txt

Then I noticed that the errors consist of either:

captura de pantalla 2017-12-29 a las 22 49 39
keirf commented 6 years ago

If it's not a hardware problem (stick, Gotek) then another plausible explanation is a bug in the USB stack, perhaps sending the same packet twice (USB full-speed packets are 64 bytes). There's a data FIFO in the STM32 chip's USB hardware, and I wonder what happens if that doesn't get filled quickly enough as data is sent. You could try raising the priority of the USB IRQ -- it is USB_IRQ_PRI at the bottom of inc/util.h and you could try changing the value from 14 to 5.

jamarju commented 6 years ago

I'll try that and report back, thanks.

tomse commented 6 years ago

I was installing MSDOS622 using 0.9.5a today, and I had several read errors. it seemed that the error occured on tracks 37 and 57, could be coincidence I don't know. I wasn't able to boot on the newly installed DOS - that may or may not be related at all as I installed using a new IDE->SD adapter + SD as a harddrive.

jamarju commented 6 years ago

Raised the USB IRQ priority to 5, didn't seem to help. Also I compared two ADFs after doing the same operation on both (copy testfile df0:). Both ADFs were corrupted at random different places, so this is not deterministic at the ADF level.

However, I can see a pattern: duplications always happens at offsets ending in 0xC0.

*** 37625,37636 ****
  00092f80: c6fe 502b 7916 a8c7 32d2 6006 cc70 ac47  ..P+y...2.`..p.G
  00092f90: 88ba 0393 8995 95a1 21a4 a417 67c0 5560  ........!...g.U`
  00092fa0: ab4c 48a8 ff1f 308d 85c3 8016 004d 865d  .LH...0......M.]
  00092fb0: 7586 3a41 f023 0258 eb10 7490 2a80 d113  u.:A.#.X..t.*...
! 00092fc0: c6fe 502b 7916 a8c7 32d2 6006 cc70 ac47  ..P+y...2.`..p.G
! 00092fd0: 88ba 0393 8995 95a1 21a4 a417 67c0 5560  ........!...g.U`
! 00092fe0: ab4c 48a8 ff1f 308d 85c3 8016 004d 865d  .LH...0......M.]
! 00092ff0: 7586 3a41 f023 0258 eb10 7490 2a80 d113  u.:A.#.X..t.*...
  00093000: 0000 0008 0000 0372 0000 0122 0000 01e8  .......r..."....
  00093010: 0000 0499 6d5a 65ff 2db0 dc73 8019 8384  ....mZe.-..s....
  00093020: 22fc 1117 4c43 87d4 f895 c8af 23f8 30cd  "...LC......#.0.
  00093030: d793 e1c8 7495 c50c e16e 51cd ab44 a84d  ....t....nQ..D.M
--- 37625,37636 ----
  00092f80: c6fe 502b 7916 a8c7 32d2 6006 cc70 ac47  ..P+y...2.`..p.G
  00092f90: 88ba 0393 8995 95a1 21a4 a417 67c0 5560  ........!...g.U`
  00092fa0: ab4c 48a8 ff1f 308d 85c3 8016 004d 865d  .LH...0......M.]
  00092fb0: 7586 3a41 f023 0258 eb10 7490 2a80 d113  u.:A.#.X..t.*...
! 00092fc0: 5ccb 04f4 69e9 0aef 9829 fe68 5a08 3473  \...i....).hZ.4s
! 00092fd0: 986b 4017 c51e d401 1d6f 85ef 17ed bd85  .k@......o......
! 00092fe0: 4e3a db79 2bb4 f8a9 0210 e81b af8f f5d8  N:.y+...........
! 00092ff0: 7920 0819 5629 1e24 a8f1 23c6 3857 a5fc  y ..V).$..#.8W..
  00093000: 0000 0008 0000 0372 0000 0122 0000 01e8  .......r..."....
  00093010: 0000 0499 6d5a 65ff 2db0 dc73 8019 8384  ....mZe.-..s....
  00093020: 22fc 1117 4c43 87d4 f895 c8af 23f8 30cd  "...LC......#.0.
  00093030: d793 e1c8 7495 c50c e16e 51cd ab44 a84d  ....t....nQ..D.M
keirf commented 6 years ago

Wouldn't particularly expect the corruption to be deterministic, this sort of bug often isn't. You haven't said whether you've tried other USB sticks or Gotek hardware, yet?

jamarju commented 6 years ago

I might have solved this problem and it was NOT related to FF.

I reformatted the Amiga hard drive (actually a 2 GB CF) and installed WB 2.1 but this time I used scsi.device 43.45 + PFS3 filesystem instead of the default scsi.device + FFS. I'm using Kickstart 37.300 which is known to cause problems with HD sizes >40 MB so it might make sense that corruption was actually originated at the HD reads.

I'm cautious because bad HD reads don't explain ADF FFS metadata corruption very well (CRC errores reading back from the ADF). But so far I've tested writing to 8 ADF images on both a Sandisk Cruzer Edge 16 GB (4-5 MB/s write speed) and a Kingston Datatraveler G4 16 GB (~16 MB/s write speed) and they all worked fine whereas I would get 100% chance of corruption with my previous config.

keirf commented 6 years ago

I wonder if there are any viruses can cause that sort of behaviour?

The definitive test would be to revert your setup, reproduce the errors on FlashFloppy, then test same configuration with a real floppy drive. But that sounds like quite a pain!

Do you want to test some more before closing this ticket down?

tomse commented 6 years ago

You did of course set maxtransfer to 0x1fe00 otherwise this will lead to read corruption on flash cards

jamarju commented 6 years ago

@tomse Yes, I used maxtransfer=0x1fe00 in both scenarios. I think switching from kickstart's scsi.device to 43.45 was the key.

@keirf Well there were two more reports but you can count mine as a no-bug. The FFS metadata corruption is plausible considering that there is no memory protection and a misbehaved scsi.device could affect anywhere in RAM.

jamarju commented 6 years ago

Also, I'm attaching the FF binary up to commit https://github.com/keirf/FlashFloppy/commit/09e5b8573fd9ae22c49cb7041d2685b14c9b0d96 with debug enabled, serial bitrate = 230400 and USB IRQ priority = 5 to help people test without going through installing the ARM toolchain and all.

FF_Gotek-v0.9.5a.upd.zip

keirf commented 6 years ago

bstrobel seems to have gone quiet. I will probably close this ticket down next week if there are no further reports.

bstrobel commented 6 years ago

Quiet yes, but I have not lost interest in this topic. Anyway, I had some problems with my Amiga that I needed to solve first. That's why I couldn't investigate further. The connector of the PSU was corroded and causing unstable power supply. It took me some time to find this out. The Amiga seems to be stable now. I will test tomorrow. I think the PSU problem could also be a possible cause of the corrupted files. I'll let you know.

bstrobel commented 6 years ago

No good news. Each file is corrupt it seems.

Test procedure:

Result:

The 5 ADF files and screenshots showing the comparison results can be found here: https://www.dropbox.com/sh/bm18ky3s1q3x59z/AAAAmZNdZPAtr1OX0iadtDuia?dl=0

bstrobel commented 6 years ago

I just noticed that I can easily attach files and pictures to these messages. So for convenience here are the screenshots:

How I copied the files on the Amiga: amiga screen

WinMerge displaying the differences: winmerge_test1_test2 winmerge_test1_test3 winmerge_test2_test3

My hex editor showing one occurrence of a difference. Note the strange data pattern there... hex_test1_test2

Zip containing the 5 ADF files that should be identical: GotekProb.zip

bstrobel commented 6 years ago

One more thing I forgot to mention. I used AmigaDOS diskcopy in verify mode. So it verifies each track after it copied it to the destination disk. This verification always succeeds. It never shows an error. I wonder how this can be. Does your software cache the data somehow? I would expect that it always reads the data from the file from the stick if it is asked to... If this is the case it would more point to a problem on the Amiga, but how can this be?

bstrobel commented 6 years ago

And finally for today the debug log. I have built the version 0.9.5a with debug enabled and baudrate set to 115200. I have renamed the version to 0.9.5a-d and used the .upd file to update the firmware of the Gotek.

I think your right that the USB stick might be too slow, because I see this in the log: Write start 50663 us Write 118/0... 3782 us Write 118/1... 3994 us Write 118/2... 3507 us Write 118/3... 3526 us Write 118/4... 3497 us Write 118/5... 4008 us Write 118/6... 3555 us Write 118/7... 3973 us Write 118/8... 3537 us Write 118/9... 3998 us Write 118/10... *** Missed write 322261 us Write start 69602 us Write 120/0... 3927 us Write 120/1... 3603 us Write 120/2... 3631 us Write 120/3... 3614 us Write 120/4... 3566 us Write 120/5... 4028 us Write 120/6... 3619 us Write 120/7... 4048 us Write 120/8... 3568 us Write 120/9... 3453 us Write 120/10... 3498 us

Here is the full log. It covers a whole disk copy: debug_log.txt

I have only 3 USB sticks (don't need them so often any more these days). They are all of the same type: "Kingston Data Travel 2.0 16GByte" USB 2.0 Sticks. They work well in my oscillograph, for OS installs on PCs etc.

keirf commented 6 years ago

Stick was too slow, try another. I will improve performance with slower sticks in a future release. If you can find one these days, I like Lexar. The reason slow sticks are a problem for disk copiers (except ones that do write+verify track by track) is that there is no flow control: disk copier moves immediately to next track and starts writing to the target, while USB stick is still writing previous track, and while that's happening FlashFloppy can't buffer that next track meanwhile (I will fix that in future).

And your Diskcopy verify works because it only verifies that it sees valid amigados track data, it doesn't check that the track data is correct (ie the result of the diskcopy)!

bstrobel commented 6 years ago

Yes, I think your right. But... As I wrote above, the disk copy software almost always reads back immediately what it has written to verify the success (not only in the Amiga). I think it is a bug that the disc copy software seems to be happy but the data has actually not been written to the stick.

The only explanation for this to me is that FlashFloppy returns the just written data from its internal memory and does not actually read it from the stick.

To illustrate the severity just look at my use case. These sticks where not the cheapest and are not very old. They are not fast but sufficient for occasional data transfer. (And Kingston is even something like a "brand".) So I as a user would assume they work fine. How can I know that they are too slow for FlashFloppy? How can I be sure that the data has been successfully written even if I use different sticks? There needs to be consistent behavior in this case and definitely a notification to the user.

Not sure how to fix this. Didn't look at your code. But I think if it is not easily achievable to always read the data from the stick I would at least display an error message on the display or blink the LEDs or something. I'd suggest you reopen this case...

bstrobel commented 6 years ago

Oh sorry, I just noticed that you already answered that diskcopy thing. Yes, you're right they most probably dont compare the read back data to the data they wrote. Didnt think of this. Anyway, you need to notify the user.

keirf commented 6 years ago

For example try with XCopy with verify enabled. You will find your copies are reliable.

Regarding "notify the user" you mean indicate via display when writes are dropped?

bstrobel commented 6 years ago

Yes, exactly. For Amiga I would even "eject" the floppy image. That would notify the user of the Amiga that the copy has failed.

You know, I copied about 20 floppies to images in a row only to find out afterwards that I cannot trust a single one of the image copies.

keirf commented 6 years ago

Well AmigaDOS diskcopy is a bit crap to be fair.

Still... it should at least be a config option to error on missed write. I might add: missed-write = error | warn | ignore

And make either warn or error the default...

bstrobel commented 6 years ago

Yes, I agree about Amiga diskcopy. :) I guess I had to learn it the hard way since I have the Amiga for a few month only yet.

Aside from that I wonder if other applications are always better than the Amiga. For instance I just can't imagine that a keyboard/synthesizer from the 80's or 90's has a better copy program (my Gotek was originally shipped and packaged for fitting into a synthesizer).

I still think ejecting the image - which looks like ejecting the floppy to the device - is the best way to show something is wrong. The image is unusable anyway. On the OLED you could just display something like "Stick too slow, image ejected". On the 7 seg display "SL" for "too SLow" maybe?

jamarju commented 6 years ago

@keirf Are you thinking of double-buffering the track? I can try to implement that and send a PR over for you to review. No guarantees at all since I'm probably overoptimistic about my understanding of the code.

keirf commented 6 years ago

@jamarju if you can decipher my code and implement that then you are a coding ninja :)

jamarju commented 6 years ago

I'm thinking of setting dma_wr->state = DMA_inactive before calling F_Sync. By the time F_Sync is called the flux ring should have been consumed fully and be prepared to receive a new track while F_Sync happens in parallel. That's assuming F_Sync is the bottleneck and not the various F_writes inside the ADF writing handler.

bstrobel commented 6 years ago

Hi again! So I ordered new USB sticks from Amazon: SanDisk Ultra Flair 16GB USB-Flash-Laufwerk USB 3.0 mit bis zu 130 MB/Sek.

I also used X-Copy 6.0.1 for copying, but: x-copy-error

Here is the debug log: new_stick_debug_log.txt

Anyway, I also did some tests and this is what I noticed:

To sum up I doubt the "slow USB stick" theory seriously. I think there are some other things wrong. What do you think?

keirf commented 6 years ago

@bstrobel looking at the log you are certainly correct there is something else up. Actually X-Copy glitches the WGATE signal at the start of track writes for some reason. I added a workaround for that back in August but I think I broke that workaround not long after!

Please can you try the patch on branch wgate-glitch-filter (i.e., "git checkout wgate-glitch-filter")? Hopefully it will work better. Really you should not see missed writes in X-Copy Verify mode. Let's drill into this further if you still do. You may for example need to put a logic analyser on the floppy interface (especially the WGATE line).

bstrobel commented 6 years ago

Thanks @keirf! I tested the wgate-glitch-filter branch. Same result. At least as far as I can tell from the log. Here it is: wgate-glitch-filter-branch_debug_log.txt

I noticed one strange thing. Where the log stops there is no "missed write" message but actually X-Copy is reporting "Verify Fehler". At lot of "missed write" errors were actually logged before when X-Copy didn't report an error. So it seems "missed write" messages and actual verify errors in X-Copy are not (directly) connected.

I'm happy to invest some time into this. No problem! Starting today I'm 3 days at home with lots of time for this. I will connect the logic analyser to the Gotek and try to make some sense of the signals. Do you have good link where I can find diagrams for the expected data flow?

keirf commented 6 years ago

@bstrobel I would expect to see approx 400ms or so spent on each track during write. 200ms with WGATE asserted and activity on WDATA. Then 200ms with RDATA active as the Amiga reads the track back to verify..

The "missed write messages" seem to occur at the end of your track writes. Perhaps on your system X-Copy is glitching the WGATE line at the end of a track write? A logic analyser trace would determine what is going on there.

The log shows track 54 being written twice. That's weird. It either means X-Copy deliberately made two attempts on that track (unlikely you would get a verify fail on two consecutive writes though, would expect fails to be randomly distributed across tracks). Or perhaps a head-step signal was missed. Worth getting a logic analyser trace for the STEP and SIDE signals too.

You don't have a speaker mod on the STEP line I assume?

I am around tomorrow, doing some other jobs including an A2000 repair, but will have time to advise you if you want to communicate on this ticket or via email (my email is in the Git repo commits). I may also have time to try X-Copy myself and see if I can repro the issue.

bstrobel commented 6 years ago

It appears I will not have as much time as I thought today. Anyway I'm going to connect the LA and set everything up now until I have to leave for some other stuff. I found the Amiga Service Manual provides all information that is needed. I downloaded it from here http://amiga-manuals.xiik.net/amiga.php.

I use the Gotek as an external floppy drive. The external connector and logic board is from Ebay. Not sure I can trust it fully. I have another external "real" 3.5" floppy drive as well.

I'll play around a bit and see if can find something noticeable. By end of tomorrow I should be able to post a few screenshots from the LA. I will also check the signal quality directly at the Gotek connector with my oscilloscope. Maybe there are some unwanted glitches on the signal lines. I'll keep you posted.

bstrobel commented 6 years ago

I was able to spend some time on this. I connected the LA and used X-Copy 6.01 to do a DOSCOPY+. It failed after the second sector: xcopy-la

This is what the LA shows: x-copy_la1

As you can see WGATE is active (low) for 2 times of about 200msec length with a very short break. I assume this is the first sector. That one is verified okby x-copy. For the second sector WGATE is also active for two 200msec periods but with a break of about 200msec in between. May this be the issue?

Nothing happens after what you can see in the screenshot. I triggered on DS0 because this is what the drive sees (although it is DS2 in fact.)

This is the full capture: x-copy-LA1.zip

You can download the LA software from the Saleae website and drill into it yourself. But I guess you know this already. :)

keirf commented 6 years ago

Thanks I use Saleae myself so this is useful. SIDE signal would have been more interesting to sample than, say, TRACK00. Then we can tell when it is trying to read/write same track back-to-back, vs switching side and doing other track on same cylinder (if you see what I mean).

I will also see about trying this myself. I know last time I tested X-Copy it was a different version (X-Copy (1993)(Cachet).adf). Can you provide me a link to exactly which 1992 version you tested (eg. a link on this site: http://jope.fi/xcopy/masterdisk.html). It is possible that different versions behave differently, also I may have tested DOSCOPY rather than DOSCOPY+, but not sure that should make much difference...

bstrobel commented 6 years ago

May the INDEX signal from the Gotek be the problem? Isn't supposed be indicating the beginning or end of a revolution of the disc and should be very stable with the same period? There is one missing between the first 2 writes. And measuring its period width I noticed it seems to jittering between exactly 200msec and 192msec. This seems to me a considerable difference if - as I assume - this is the reference point for de-serializing the data. x-copy-index-jitter

bstrobel commented 6 years ago

I will do another trace with SIDE signal instead of TRACK00. I thought with TRACK00 we could reference with the STEP and DIRECTION signal which track is actually written. But we are stuck on track 0 already, so it doesn't make sense.

bstrobel commented 6 years ago

Here is the sample: x-copy-LA2 - SIDE signal.zip

Regarding the X-Copy version, I use a disk that I got with my Amiga. When I boot from it I get this screen: xcopy-version

Then I press F6 for version 6.01.

keirf commented 6 years ago

Discussed this with @bstrobel offline and it seems there are possibly two issues:

  1. X-Copy issues writes back-to-back for the two sides of each cylinder, then does the verifies back-to-back. i.e., side-0 write, side-1 write, side-0 verify, side-1 verify, step to next cylinder and repeat. The time between the back-to-back writes is approx 3ms so unless you have a reliably very fast stick for 512-byte writes you will generally miss the second write for each cylinder. The only sticks I know are fast enough in my own testing are some Lexar models (Lexar S45/S75). The solution sadly is to implement multi-buffered writes. It's a chunk of work. I will raise a separate issue to track it.

  2. We observed the INDEX pulse inexplicably occurs some milliseconds early after the first write to each cylinder. That is, the first INDEX pulse that occurs after FF reverts to read mode (because it missed the start of the second write) is earlier than expected, with no serial log message to warn of this. It is probably not related at all to the ADF write corruptions, but I will try to reproduce this and see if there are bugs in my INDEX timing management. I may raise an issue for this also. [EDIT: I realise now that this behaviour is expected, as INDEX gets resynced after a write. Furthermore, to fix #44 I am probably going to even more deliberately break INDEX sync across writes.]

keirf commented 6 years ago

FYI I will work on pipelined/buffered writes after I release v0.9.7a next week. With luck it will be in v0.9.8a.

bstrobel commented 6 years ago

Thanks Keir! Great news!

2018-01-28 11:36 GMT+01:00 Keir Fraser notifications@github.com:

FYI I will work on pipelined/buffered writes after I release v0.9.7a next week. With luck it will be in v0.9.8a.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/keirf/FlashFloppy/issues/32#issuecomment-361053129, or mute the thread https://github.com/notifications/unsubscribe-auth/AFX9MBHkcCJwsXDyI-TOBAMMfDhJVE8Kks5tPE21gaJpZM4Q8eXI .

keirf commented 6 years ago

Please can you try the following patched firmware. Ignore the version number (it is v0.9.7a with patches on top). Pipelines writes. Should fix your issues.

Alternatively you can build tip of current master (this contains the same patches).

ff_32_1.zip

bstrobel commented 6 years ago

Thanks Keirf! I tried your ZIP. Unfortunately it doesn't work. It always fails on the track 0, side 1: pipeline-0 9 7_verify_error_doscopy

I've also run the LA once with DOSCOPY and once with DOSCOPY+. Both recordings are in this zip file: pipeline-0.9.7_LA1.zip