Address-to-line mappings should give a memory address instead of a ROM offset

SolraBizna commented 3 years ago

Okay, I know I said it looked great, but here's ooooooooooone more of these... sorry!

The second field of an address-to-line mapping currently gives an offset from the beginning of the ROM. This is quite awkward for me to process, since in my target system the relationship between intended memory address and ROM offset can be... complicated. I may just not be thinking clearly about it, but I think in one of my actual cases it's even ambiguous*. In addition, WLA appears to truncate the offset to 16 bits, so even if I do manage to parse the address correctly, ROMs larger than 64K will have problems. Labels, on the other hand, are currently given as memory addresses, which carry no ambiguity and make it easy for my debugger. It would be easier for me if both were given as memory addresses, not to mention appealing to my sense of consistency.

Would this be complicated to set up? For that matter, are there existing applications that this would break? How long has the address-to-line mapping feature even been here? (I only found it this weekend, minutes after wishing it existed, so I've been choosing to assume this feature sprang into being as a direct result of my wish, rather than the boring explanation that it's been there for a while and I'm just not up to date on new features.)

*Specifically, some cartridges use something called "wide range hack" where the slot number affects which ROM chip is selected, and that chip is hooked up to the A pins in a different way. It's a long story... but if each .sym file consistently gives the memory addresses instead of ROM offsets, the debugger doesn't need to know about the "wide range hack" or any other screwy wiring of the cartridge in order to get it right.

vhelin commented 3 years ago

The first field is bank (+ base)?

I'm not sure if symbol file writes has the address information when writing addr-to-line mapping, but I'll check it out today evening.

I don't remember when this feature was added as I think someone else than me added it. :) But we can change the WLA symbol file format as this is now going to be v10.0 release, a new major number, and thus we can change some things a little.

SolraBizna commented 3 years ago

I was calling each number a separate field, I'll call them numbers instead to avoid confusion. My testing indicates that the second number actually is the offset from the start of the ROM, not from the start of the bank (which might have been originally intended?).

It looks like @CypherSignal added this feature originally. If they added it, they probably needed it, and if they need it, they might be annoyed if it changes now. (Thoughts, @CypherSignal?) I'd guess it was probably mainly intended to target the SNES, whose ROM address mappings are pretty straightforward...

But we can change the WLA symbol file format as this is now going to be v10.0 release, a new major number, and thus we can change some things a little.

That's exciting. WLA-DX is growing up before our eyes...

vhelin commented 3 years ago

I was able to add more data to [addr-to-line mapping]:

[addr-to-line mapping] 00000000 00:0000:0000 0002:0002:00000003 00000002 00:0002:0002 0001:0002:00000003 00000004 00:0004:0004 0001:0003:00000003 00000006 00:0006:0006 0002:0001:00000017 00000008 00:0008:0008 0001:0001:00000017 0000000a 00:000a:000a 0002:0003:00000003 00002000 01:0000:a000 0003:0001:00000017 0001e000 0f:0000:a000 0003:0001:0000001f 0001fff0 0f:1ff0:fff0 0003:0001:00000027

First is the full address in ROM, next bank (+ base), next offset in ROM bank, next memory address, and then object file ID, source file ID (in the object file), and line number.

I hope they all work, I had some trouble understanding my old code that calculated the addresses... Perhaps I should rename some structure fields in WLALINK...

vhelin commented 3 years ago

Sorry, tweaked it a bit, the new format is

[addr-to-line mapping] 00000000 00:0000 0000 0002:0002:00000003

(separated memory address from ROM bank and offset)... That should be it, if you think it's ok.

SolraBizna commented 3 years ago

Well, that should cover everybody's needs. And any existing application that depended on the old format only needs a small change to detect the new format and still extract the same information. It looks okay to me.

And now I promise not to file any new issues asking for changes in the symbol file format for at least another 12 hours. ;)

Edit: I did test it and all the values appeared correct on my end. I forgot to mention that.

CypherSignal commented 3 years ago

Sorry, I know you already closed this, but you asked for my opinion and I would like to give it:

This is a, major, breaking, non-backwards compatible, change to an already-published file format/specification. An existing parser of [addr-to-line mapping] data will completely fail to parse new files generated from this version forward, and there's no adequate way to determine which way the parser should go ahead-of-time. Also, it is inconsistent with, say, an implementation of it in asar that I introduced there, in their existing wla-symbol generator, as well (see: https://github.com/RPGHacker/asar/pull/125 ).

If you want extra definitions of RAM address to lines, that should be a separate section that wla-symbol parsers can choose to ignore or parse, as needed (as they are currently able to do if they are not aware of, say, the [addr-to-line mapping] section and others). If the existing section is insufficient due to wla-dx now targeting other machines/architectures that don't use 8+16-bit addressing, then perhaps a new section definition will have to be created, as well.

Note that all of this applies to the recent commit for object-file indices, as well, as that will also break existing parsers of the [source files] section, and also there's no way to clearly disambiguate what "version" of the section should be considered or not: https://github.com/vhelin/wla-dx/commit/8a3a9dd723b87e2b4a47dd0686cf24fe52ea0687

SolraBizna commented 3 years ago

WLA-DX's usual policy is not to worry about breaking changes, but you raise a good point that the symbol file is a public format, not something that can be fixed by doing make clean all to rebuild everything with the same version of WLA.

The original way the addr-to-line mappings were written was partly broken unless your banks happened to be 64K in size. The parsing application must mask by the bank size. What do you think we should do about that? Since you added the feature, you can speak to the original intent, and you're clearly aware of existing usages of this.

Unfortunately, at least one existing parser will still break if the existing format is kept and extra data added "on the right"... all of my own symbol file parsing is done by regexes with a $ on the end.

... there's no adequate way to determine which way the parser should go ahead-of-time.

With a regex-based parser like the one advocated in the documentation, it's a simple matter of trying both the old and the new regex, and whichever one matches is used. This would even work in the nightmare scenario of two versions of the symbol format being mixed into one file. Hand-rolled parsers can use a similar approach. Am I missing something here? Neither of the recent changes to the format change the meaning without also changing the structure.

vhelin commented 3 years ago

Note that all of this applies to the recent commit for object-file indices, as well, as that will also break existing parsers of the [source files] section, and also there's no way to clearly disambiguate what "version" of the section should be considered or not: 8a3a9dd

The old code didn't create unique IDs so it had to be fixed as it was broken. The fix changed the format. That's ok by me as we are now moving to v10.0 from v9.12 - major number is changing and we can thus break some old things to make them better than before.

I'm against breaking compability in general with old projects, but I don't see much harm in this case. File formats change sometimes. Updating file parsers that parse these symbol files is a big issue? Perhaps we should introduce version number in the symbol files like we have in object and library files?

SolraBizna commented 3 years ago

If a version number is ever going to be added, now is the best time to add it. As far as I know, this is the first time a change is being made to the symbol format that breaks existing parsers. This time, determining which version to use is decidable with a little pain. Next time it'll probably be harder. And the sooner a version number is added, the more consumers will use it.

Edit one second after posting: Aha. I see we have one now. Well then.

CypherSignal commented 3 years ago

What do you think we should do about that? Since you added the feature, you can speak to the original intent, and you're clearly aware of existing usages of this.

Well, the intent as mentioned was for dealing with machines that with use 8-bit bank address and 16-bit bank offsets. The primary use case was for 65816 and 6502 machines, and what I recall was that the other machines that wla-dx could emit assembly for had similar constraints, so there wasn't a problem there: either the machine could only address 16-bits of data, or it was 16-bits plus 8-bit banks. The original name of the section may have been implicitly restrictive in retrospect, since it doesn't specifically say "addr-to-line mapping-24b" but for the scenarios being considered it was sufficient at the time.

With a regex-based parser like the one advocated in the documentation, it's a simple matter of trying both the old and the new regex, and whichever one matches is used. This would even work in the nightmare scenario of two versions of the symbol format being mixed into one file.

Then the documentation should reflect that, then, as a minimum: "There could be two formats of data in this section: format A and format B; parse a line, decide which fits, and then continue for the rest of the section". Even with that, there are two problems:

1) You're still breaking parsers that are expecting only 1 of those formats without awareness of what the other could be. 2) It's also just awkward to write a parser for, because every other section knows exactly what data it has to look for, and can parse things blindly, without inspecting the data itself. This, will.

My point is, you're talking about different data, to solve a different problem, and the file format already has a solution for that that doesn't break anything else: (well, shouldn't, as per the spec):

Make a new section name and definition.

That action is not that big of a deal, but changing the definition of an existing section is.

CypherSignal commented 3 years ago

Also, yes, as mentioned, this kind of file-spec breakage is not something that can be fixed by doing make clean because it not only affects parsers, but also generators of those files. Those other programs (which, admittedly, may just be asar, and maybe that's easy to fix up) are now generating files that don't match the published spec. What should a parser work with? Should something like asar just say "Oh we don't support generation of wla-dx symbols anymore, we have our own fork of the file format"?

vhelin commented 3 years ago

Uh, pressed the wrong button...

Those other programs (which, admittedly, may just be asar, and maybe that's easy to fix up) are now generating files that don't match the published spec.

I've updated doc/symbols.rst so the published spec is up to date.

SolraBizna commented 3 years ago

Well, the intent as mentioned was for dealing with machines that with use 8-bit bank address and 16-bit bank offsets. The primary use case was for 65816 and 6502 machines, and what I recall was that the other machines that wla-dx could emit assembly for had similar constraints, so there wasn't a problem there: either the machine could only address 16-bits of data, or it was 16-bits plus 8-bit banks. The original name of the section may have been implicitly restrictive in retrospect, since it doesn't specifically say "addr-to-line mapping-24b" but for the scenarios being considered it was sufficient at the time.

But if the banks aren't 64K, then the offset-within-bank in the old code is incorrect. What I'm asking is, how should that be fixed?

Giving an example, the project I'm working on now (and which triggered me to look into address-to-line mappings in the first place) uses 8K banks and has only two of them. A line number at the beginning of the second bank ($81, because of reasons), mapped into the second slot, gave 81:2000. $2000 is too large to strictly be a bank offset. My own needs would have been met if it gave 81:E000 (WLA using its information on bank and slot mapping to set the correct memory address). Changing it to give 81:0000 (bank and offset-in-bank) would still have left me having to plumb in some extra address decoding logic*, but it would have been correct to the intent. In either case, an implementation that masks by bank size, or one that assumed 64K banks to begin with (like with SNES HiROM), would still Just Work. Leaving aside the issue of format breakage, working within the confines of the old format, which do you think should have been done?

*And massive headaches to properly deal with the "wide range hack", but it's not like there weren't already going to be headaches there, and I don't support it yet anyway...

Make a new section name and definition.

B-but that will waste so much memory and disk space! Kilobytes and kilobytes of it! Think of the children! (I joke, but I'm old enough I still have trouble shaking this attitude.)

Something like [addr-to-line mapping v2] and [source files v2]? That would still require a separate fix for #416 if we continue generating the old [source files] section, and if we don't, then old parsers will not see source file data at all instead of choking on the new format. I personally would rather have my parser choke so I know there's a problem instead of having my address-to-line mapping code silently and mysteriously stop working on recently-built artifacts, but I could see one's preferences going the other way.

vhelin commented 3 years ago

I think this is getting too complex.

WLA DX v10.0 is currently under work, it's not been published
Some breaking changes have been made, and the major number has been increased because of this. Not many, but some.
v10.0 can still change until it's been released
WLA symbol file version is now 2 and the version number can be read from the WLA symbol file so all WLA symbol file parsers have a good chance of working from now on
doc/symbols.rst has the latest specifications of the v10.0's symbol file format. Just like the same file in WLA DX v9.12 described its symbol file format.
The new WLA symbol file format fixes two big issues/bugs that the previous file format / WLA symbol file writer had.

I don't see any issues. If you guys need more information in the WLA symbol file then now's a good time to say that. :)

PS. I think version 2 of WLA symbol file format should also support 64K banks...

CypherSignal commented 3 years ago

Something like [addr-to-line mapping v2] and [source files v2]?

Yes, exactly. Or, something along those lines if other, specific, semantics are required (such as a separate object-file section if the source file data is not relevant). In fact, I would submit that this would be far more preferable than the

[information]
version 2

section added at the top of the file now.

Which, by the way, (was writing while your post was submitted, @vhelin) assuming you want to stay on this course, the definition of that section also needs to have specific restrictions/guarantees indicating where that section is permitted to exist, so that parsers that scan the file in a purely linear fashion don't have to change versions mid-stream and so that alternative generators know where to put that section. As is, every other section can exist in any order, with all data being collated after-the-fact, but this changes that rule.

That would still require a separate fix for #416 if we continue generating the old [source files] section, and if we don't, then old parsers will not see source file data at all instead of choking on the new format. I personally would rather have my parser choke so I know there's a problem instead of having my address-to-line mapping code silently and mysteriously stop working on recently-built artifacts, but I could see one's preferences going the other way.

I'm not saying that the old sections should continue to still be generated by wla-dx, though. What I'm saying is that support for those sections -- i.e. not changing their original published format, and not breaking the existing definition -- should be upheld, so that parsers can know what to expect, just as a philosophical tenet of maintaining a public file format.

CypherSignal commented 3 years ago

One other note, if you're going to go the route of having specific 'file versions' to inform the format of other sections, it would be also be beneficial to at least have hard links back to those in the git history, or something, so that if someone goes "Oh, I only expect 'version 1'/'versionless' wla symbol files", they know exactly what to refer back to.

SolraBizna commented 3 years ago

Tacking v2 on the ends of the affected section names seems simple enough to do. If no longer emitting the old versions at all is satisfactory to you, then I'm satisfied too.

Edit to add:

PS. I think version 2 of WLA symbol file format should also support 64K banks...

The old version supported 64K banks just fine in my experience, and the new one hasn't done anything that would break that.

vhelin commented 3 years ago

I'm not removing the new [information] section as we can add useful information to that, and just by checking the version number in the future parsers can easily detect breaking changes.

vhelin commented 3 years ago

Damn I press the wrong button all the time. :D

SolraBizna commented 3 years ago

Keeping the new [information] section as well is still a good idea. I think @CypherSignal and I would both be happier if it came with documented guarantees. Maybe just the following sentence in the documentation: "[information], if present, will always the be present before any other section or data, and its first line will always be the format version."

And, having the "close with comment" and "comment" buttons so close together is bad UI design. Maybe we should complain to GitHub about it. :)

I'll leave it to @CypherSignal to close when all are satisfied with the outcome.

vhelin commented 3 years ago

Keeping the new [information] section as well is still a good idea. I think @CypherSignal and I would both be happier if it came with documented guarantees. Maybe just the following sentence in the documentation: "[information], if present, will always the be present before any other section or data, and its first line will always be the format version."

Done.

And, having the "close with comment" and "comment" buttons so close together is bad UI design. Maybe we should complain to GitHub about it. :)

Also having a couple of pints of your own 8.3% IPA doesn't help with that... :D

I'll leave it to @CypherSignal to close when all are satisfied with the outcome.

CypherSignal commented 3 years ago

As mentioned, it would still be good to have links to specific commits or tags for later reference, when needing to take different version numbers into account.

vhelin commented 3 years ago

As mentioned, it would still be good to have links to specific commits or tags for later reference, when needing to take different version numbers into account.

I have to admit that I don't know how to do that properly. All pull requests are welcome. Or information about how to do that in this case...

vhelin commented 3 years ago

I assume that this issue is now solved. I'm not going to release v10.0 yet, but waiting for some time if people report other issues. Perhaps release v10.0 in late June 2021?

vhelin / wla-dx

Address-to-line mappings should give a memory address instead of a ROM offset #417