Proof of Concept for Snapshot saving and loading.

acf commented 3 years ago

Fixes #8 (sort of) Some notes.

1/ No tests, this is very much hacked together just to see if I could make it work. I used Amiga types for this code rather than Amiga-Smalltalk types, to distinguish "inside the VM" from "straight amiga code". made sense in my head when I did it even if I can't quite articulate now. So look for UWORDS and LONGS and BYTEs instead of Words and Bytes.

2/ Its pretty much all in the one file ObjectMemory_snapshot (.c and .h) apart from the addition of a hasObject method in ObjectMemory_storage

3/ I somehow accidentally enabled C99 on the compiler and by the time I discovered it, it was too much of a pain to unpick, just for the sake of this little POC. so I adjusted the Makefile.vbcc to have a -c99 just for snapshot.

4/ The ObjectMemory_snapshot file contains a main which simple loaded SYS:image.st and the. saves it again. I included the ST80 VMImage as image.st in the commit. copy this to SYS: or change the main to load elsewhere.

5/ load is pretty quick, relatively speaking. Save is taking 2 hours and counting. :-( Some part of this is the printf logging, but without that you're just staring blankly for 30-40 minutes wondering what's going on. It could be that we need to speed up save by building a big memory buffer and writing in one go. writing a WORD at a time unbuffered to disk is kind of dumb :-). Might also be worth changing main to just load and skip the save before you run it the first time.

6/ I have done very little testing of what is loaded. The objects an object table all seem to make sense but I guess some of the bigger ones (64K words in some cases) are compiled methods... I'm cargo-culting at that point.

I'm keen to see if its useful to you. Its obviously not in shape to be distributed very far given the lack of tests etc, but I was pleased to load that image file successfully :-)

acf commented 3 years ago

sob something is clearly wrong as many of these objects are listed as having 64K words and the image which went in was 516K and I just stopped the save with the image out being 150Mb. I have screwed something up between being delighted it was working and now :-/

Have a look anyway, but while load seems to work fine, I think its the same problem I had with the basic first-52, somehow reading an object is screwing with the previous object. sigh

acf commented 3 years ago

changed the main to just load and not save... maybe you can tell what's happening there.

iamleeg commented 3 years ago

nice! I'll have a look…at some point, though the laptop I was developing Amiga-Smalltalk on died last year and I bought an M1 instead of getting it fixed so need to set up an environment again :(. Some initial observations:

I somehow accidentally enabled C99 on the compiler

I didn't even know that was a thing! Leave it on, please!

It could be that we need to speed up save by building a big memory buffer and writing in one go.

What I would have thought would work is pausing the VM, and writing out the heap segments wholesale (and probably the registers, so we know where to start from on loading the image). Is that insufficient?

many of these objects are listed as having 64K words

Sounds like something is being subtracted from something else and wrapping round: all of the "WORDS" are 16-bit so 65535==-1.

acf commented 3 years ago

its just a -c99 flag on the compiler, its on for snapshot, can also turn it on for ast_tests :-)
not directly, the way the snapshotting works is the written object table has the segment/location replaced with essentially a WORD index into the disk image and then then when its loaded up again that index is replaced with the new segment/location. If we just dump out the segments as is we aren't able to write an object table that can find them again, especially as the objects may not be contiguous in the segment if its not recently been compacted. So we do need to go object by object, counting up the sizes written, but probably not word-by-word. If an objects words are always contiguous real memory, we can write the whole object in a single write. If not, we can AllocVec a real memory buffer and then write that.
yup. I will find time for more digging. Its one of these things where it all loaded successfully without error and the right number of objects, it was only when I tried to save to back out again that I spotted some of the data was garbled.

acf commented 3 years ago

its 65534, interestingly.

iamleeg commented 3 years ago

65534 is -2, and 2 is the length of the object header, so make sure that sizes are balanced with sizes and word length with word length: the word length of an object is (size - 2). Objects should always be contiguous in memory because the allocator finds a fragment big enough to hold the entire object on allocation, and fails if it can't do that.

acf commented 3 years ago

I was distracted for a while by my little git thing which can now init, add files and commit (all somewhat clumsily). I shall pause there and turn my attention back to smalltalk to see if I can fix this. I'll then get back to do remove and status :-). I'm saving pushing and pulling for a rainy day.

acf commented 3 years ago

Everything is fine up till OOP 242 and then something goes very wrong on load :-). All the sizes and lengths go nuts which suggests I'm somehow reading the wrong bytes from the disk.

I guess more caveman debugging required :-)

acf commented 3 years ago

OK, stopping now for a bit. Here's what I see....

The first bit is debug out from loading the objecttable. This (as you'll recall) has the actual word address on disk encoded in the segment and location fields. Here we see OOPs from 232 to 248, all looks good.

The second bit is where we load the objects. We iterator through the object table, and do

UWORD objectImageWordAddress = (ObjectMemory_segmentBitsOf(iterator) << 16) + ObjectMemory_locationBitsOf(iterator);
UWORD objectImageByteAddress = objectImageWordAddress * 2;

to build up the absolute byte address to seek to. ie we use the previously loaded segment/location.

What you can see here is that in-between loading the OT and loading the objects, the location values for 244, 246 and 248 are all corrupted. So something is writing over them accidentally. I saw this before when I tried with the raw first-fifty and hoped loading "real objects" might work better, but no. SO... I'm fairly sure there's a bug somewhere that loading sone field is overwriting another field.

Sadly my command line debugging skills are essentially non existent so its going to take a while.

LOADED OT FOR OBJECT 232  [0580 - 065c]
SEGMENT BITS 0000
LOCATION BITS 065c
LOADED OT FOR OBJECT 234  [2580 - 066d]
SEGMENT BITS 0000
LOCATION BITS 066d
LOADED OT FOR OBJECT 236  [1180 - 0675]
SEGMENT BITS 0000
LOCATION BITS 0675
LOADED OT FOR OBJECT 238  [0a80 - 067c]
SEGMENT BITS 0000
LOCATION BITS 067c
LOADED OT FOR OBJECT 240  [3400 - 0681]
SEGMENT BITS 0000
LOCATION BITS 0681
LOADED OT FOR OBJECT 242  [0480 - 0686]
SEGMENT BITS 0000
LOCATION BITS 0686
LOADED OT FOR OBJECT 244  [0380 - 0695]
SEGMENT BITS 0000
LOCATION BITS 0695
LOADED OT FOR OBJECT 246  [0480 - 06a5]
SEGMENT BITS 0000
LOCATION BITS 06a5
LOADED OT FOR OBJECT 248  [0480 - 06ac]
SEGMENT BITS 0000
LOCATION BITS 06ac

OBJECT 232 WORD ADDRESS 1628  BYTE ADDRESS 3256]
 -- SEGMENT BITS 0000
 -- LOCATION BITS 065c
OBJECT 234 WORD ADDRESS 1645  BYTE ADDRESS 3290]
 -- SEGMENT BITS 0000
 -- LOCATION BITS 066d
OBJECT 236 WORD ADDRESS 1653  BYTE ADDRESS 3306]
 -- SEGMENT BITS 0000
 -- LOCATION BITS 0675
OBJECT 238 WORD ADDRESS 1660  BYTE ADDRESS 3320]
 -- SEGMENT BITS 0000
 -- LOCATION BITS 067c
OBJECT 240 WORD ADDRESS 1665  BYTE ADDRESS 3330]
 -- SEGMENT BITS 0000
 -- LOCATION BITS 0681
OBJECT 242 WORD ADDRESS 1670  BYTE ADDRESS 3340]
 -- SEGMENT BITS 0000
 -- LOCATION BITS 0686
OBJECT 244 WORD ADDRESS 15  BYTE ADDRESS 30]
 -- SEGMENT BITS 0000
 -- LOCATION BITS 000f
OBJECT 246 WORD ADDRESS 7230  BYTE ADDRESS 14460]
 -- SEGMENT BITS 0000
 -- LOCATION BITS 1c3e
OBJECT 248 WORD ADDRESS 16385  BYTE ADDRESS 32770]
 -- SEGMENT BITS 0004
 -- LOCATION BITS 4001

iamleeg commented 3 years ago

Sadly my command line debugging skills are essentially non existent so its going to take a while.

Pro tip for this kind of bug: on a platform that actually has watch points (e.g. x86_64), use watch points. Set a watch point on the memory address of the first OOP that gets corrupted, and see what's changing it.

acf commented 3 years ago

a lot of printf later, putting size for OOP 12 results in a change on the location bits for 244 :-)

acf commented 3 years ago

Yeah confirmed by more printfs in the segment_word_put.

WE ARE WRITING LOCATION OF 244 to 0, 245

WE ARE WRITING SIZE OF 12 to 0, 245

so some math somewhere is off.

I'll strip out all the caveman debugging and start figuring out how I calculate where in RealWordMemory to store an object and why OOP 12 thinks it starts at position 245 when that's clearly in the OT for 244

acf commented 3 years ago

Found the problem.

https://github.com/iamleeg/Amiga-Smalltalk/blob/master/ObjectMemory_Constants.h#L7

vs

https://github.com/dbanay/Smalltalk/blob/f647e589cfa84551899b61b48a638dae2d4e77d9/src/objmemory.h#L777

The dbanay code stores the OT in segment 15 and you store yours in segment 0. When I load the objects in using his code, it trashes the ObjectTable.

The blue book doesnt seem to specify so I can either:

leave ObjectTableSegment at 0, change FirstHeapSegment to 1. [OT][a l l t h e. h e a p]

or

change ObjectTableSegment to HeapSegmentCount-1 and change LastHeapSegment to HeapSegmentCount-2. [a l l t h e. h e a p] [OT]

Any preference ?

acf commented 3 years ago

I changed FirstHeapSegment to 1 and LOAD all works a charm. until we run out of memory around 5000 objects in. I pushed that as a default for now and then will go look and check how our memory layout matches the dbanay version (which loads the whole image fine)

iamleeg commented 3 years ago

Ah OK, the blue book isn't clear on that, but I can understand that the first/last heap segment shouldn't actually include the OT. I have a very weak preference for doing the same as dbanay just to avoid other surprises, but if you've got it working the other way around it doesn't matter enough to change it.

acf commented 3 years ago

I can change It to match his in. couple of lines of code so happy to try that. Also his ObjectTableSize is FFFF instead of 7FFF and it seems we need it for this image. Blue book suggests 64K for the table allowing 32K objects, which we need for this image.

acf commented 3 years ago

Notes for myself. On the way now. Can load and save the image and its the same size and layout, all well. On doing a DIFF with Hexffiend, however (cause what I write should be identical to what I read) there's a weirdness. Object 684, which is a free object with a defined seg/location of 0/0 and size of 32 when I read it in, ends up with a size of 32832 after this:

https://github.com/iamleeg/Amiga-Smalltalk/blob/acf/snapshots/ObjectMemory_Snapshot.c#L257-L261

Once that's wrong, all the location words in the OT after that are wrong... all the actual object content is correct.

on further caveman debugging, that code basically sets the location bits for 684 from 0 to the previous head (in this case 2), and those same location bits are then later used to get the size by way of ObjectMemory_heapChunkOf_word(684, 0).

So a change in location, could always result a change in size, especially since its not a real location.

In essence it doesnt matter what goes in that location since I never actually write a free object, but there is cumulative maths where I used the size of each object to determine what the location is going to be for the next object and the next, so I need to look again at how that is preserved in the dbanay code.

https://github.com/iamleeg/Amiga-Smalltalk/blob/acf/snapshots/ObjectMemory_Snapshot.c#L182-L194

that code writes the correct thing for the free pointer itself, but I suspect something has gone awry with calculating the next location when I extracted a function or two vs the dbanay code

Thats as far as I've gotten last night and today, so notes for myself for when I pick it back up again. The mechanism for the free pointer list seems sensible, so look at dbanay and discover how the size of a free object is calculated and if its different from how we do it.

Update, yup.

https://github.com/iamleeg/Amiga-Smalltalk/blob/acf/snapshots/ObjectMemory_Snapshot.c#L164. I do this for every object and the bogus size bits are throwing it out.

he does it only in a non-free pointer https://github.com/dbanay/Smalltalk/blob/6a7619cba3d7a3bd3d9807e429bf00d6c6886061/src/objmemory.cpp#L335

FACEPALM #3

acf commented 3 years ago

Clearly typing it into a gh comment rubber duck was the answer. All done. Works. Yay.

Its mergeable as is, though I need to update the other Makefiles as I only worked on VBCC. Its all untested of course, but that's why its all in a completely separate file with its own main..

iamleeg commented 3 years ago

Nice work!

iamleeg commented 3 years ago

I'd consider a round-trip to be an e2e test, btw ;)

acf commented 3 years ago

:-) Sure, but I don't think we want to add that to the CI build as it takes quite a long time. In other news, it suddenly occurs to me that its all predicated on a BigEndian machine so works fine on an actual (or emulated) classic amiga. I don't know what your AROS CI is running on and that's another reason why its off to one side in a file of its own. :-)

I chatted with Dan Banay and he byteswapped the image file so the "standard" code kept working. That would make it hard to use the same image across 68K an intel...

iamleeg commented 3 years ago

Ah, I hadn't thought of that. Should RealWordMemory use htons and ntohs uniformly?

acf commented 3 years ago

Well My own personal take is that AmigaSmalltalk should run on an amiga and that's BE which matches the Alto and the ST80 image so that's all good, and selfishly, all I care about :-P. I suppose we'd maybe have a problem if we tried to save an image on intel and then load it on 68K. There is, however, a couple of interchange format words in the image spec. 0x0000 is the standard format the image came in and I got working. We could ensure an intel machine writes out something else so we can tell the difference when we load ? The code that turns an on-disk location into OM+location might be fiddly.

https://github.com/iamleeg/Amiga-Smalltalk/blob/master/ObjectMemory_Snapshot.c#L194-L195

acf commented 3 years ago

https://github.com/dbanay/Smalltalk/issues/9#issuecomment-750512430

iamleeg / Amiga-Smalltalk

Proof of Concept for Snapshot saving and loading. #25