mathics / Mathics

This repository is for archival. Please see https://github.com/Mathics3/mathics-core
https://mathics.org
Other
2.07k stars 205 forks source link

Revise code to allow emitting unicode when desired instead of WL-specific codes #1075

Open rocky opened 3 years ago

rocky commented 3 years ago

Front-ends should be able to declare that Unicode is handled (as opposed to WL special characters) and formatting should respect that rather than require WL-specific fonts.

To do that each operator, like the one for pattern, may need to enter the unicode symbol as it does for the ascii represetation, e.g. -> for pattern which it has now.

See also #206

GarkGarcia commented 3 years ago

Front-ends should be able to declare that Unicode is handled (as opposed to WL special characters) and formatting should respect that rather than require WL-specific fonts.

I can't imagine a scenario where WL special characters are desired (in detriment to Unicode), shouldn't we simply replace them with Unicode altogether? Also, as noted by @rocky is other comments, this is an issue that affects multiple clients of ours, so this should be dealt-with in here (instead of in individual clients).

rocky commented 3 years ago

I can't imagine a scenario where WL special characters are desired

Then you don't have a good imagination :-)

mmatera commented 3 years ago

Another reason is about structure: if you want to copy the output of an evaluation, and use it again as an input, it is important to interpret properly if "d" means a variable name of a symbol. It would be also important for compatibility if we want to use WL packages written in MMA. For all these reasons, (and maybe several more) I think that Mathics kernel should use these special codes, and the job of translate to something readable/writable for a client UI should be in charge of the client. What we could provide in the kernel distribution are certain auxiliary routines (at the level of the library, but maybe not in the kernel itself) that translate expressions to a plain ascii representation (for instance, with named characters, \[ ... ], or simply text representation)

rocky commented 3 years ago

@mmatera I like the idea of having some sort of interoperability routines. To start out, what was added in the PR (and is duplicated with that additional symbol) in mathicsscript could be moved somewhere, and then imported in those two places. (@mmatera I imagine you have something equivalent or better in iwolfram which might be contributed to the common location.)

GarkGarcia commented 3 years ago

I can't imagine a scenario where WL special characters are desired

Then you don't have a good imagination :-)

  • In #206 vcat had suggested installing WL fonts, so I can imagine that this person has done this

  • I ran across this issue in mathicsscript. I think I was using the copy button from one of the WL examples and it came across that way and was accepted as input just fine. Although many terminals support unicode, at least one I tried didn't on GNU/Linux. And I think one of the Windows terminals didn't support Unicode as well.

  • It is not clear that all versions of TeX supports unicode without jiggering.

Thanks for clearing this out! And yes, terminals in Windows usually don't support unicode.

What we could provide in the kernel distribution are certain auxiliary routines (at the level of the library, but maybe not in the kernel itself) that translate expressions to a plain ascii representation (for instance, with named characters, \[ ... ], or simply text representation)

This seems quite elegant indeed, I think we should go for it.

@mmatera I like the idea of having some sort of interoperability routines. To start out, what was added in the PR (and is duplicated with that additional symbol) in mathicsscript to be moved somewhere to the core, and then called those two place. (@mmatera I imagine you have something equivalent or better in iwolfram which might be contributed to the common location.)

I agree, I'll work on a PR for this today. Is there a place on the web where I can get a list of all special characters used by Mathematica?

rocky commented 3 years ago

You can look at https://github.com/Mathics3/mathicsscript/tree/master/mathicsscript the inputrc files and termshell.py

I gave up in bordom trying to fill out the parameter letters with dot's under them.

GarkGarcia commented 3 years ago

I gave up in bordom trying to fill out the parameter letters with dot's under them.

Ok, so I should map α to \[Alpha] and vice-verse, right?

I gave up in bordom trying to fill out the parameter letters with dot's under them.

I can fill the missing ones, but how can I figure out what's their WL equivalent? For example, how did you find out that \uf81a maps to ?

rocky commented 3 years ago

I can fill the missing ones, but how can I figure out what's their WL equivalent? For example, how did you find out that \uf81a maps to Ạ?

https://reference.wolfram.com/language/guide/ListingOfNamedCharacters.html has a list of characters. If you click on any one like M with the dot under it it shows its Unicode as F826. Enter that in your favorite tool to show unicode and it doesn't look like an M with a dot under it.

Google for "M with dot underneath" and https://www.compart.com/en/unicode/U+1E43 is the first thing that comes up.

Ok, so I should map α to [Alpha] and vice-verse, right?

For that one in particular the unicode value is the expected value. I believe inside the tokenizer that kind of stuff is already handled.

However symbols like the derivative "d" and directed arrows, those Unicode symbols WL chose "private" values which I am told it is technically allowed to do, but these mappings aren't expected without some sort of tweaking by most programs that render unicode. And almost all of these symbols have reasonable alternatives, such as the ones for dot under a letter. These alternatives I have found do get rendered correctly by programs that show unicode.

GarkGarcia commented 3 years ago

I can fill the missing ones, but how can I figure out what's their WL equivalent? For example, how did you find out that \uf81a maps to Ạ?

https://reference.wolfram.com/language/guide/ListingOfNamedCharacters.html has a list of characters. If you click on any one like M with the dot under it it shows its Unicode as F826. Enter that in your favorite tool to show unicode and it doesn't look like an M with a dot under it.

Google for "M with dot underneath" and https://www.compart.com/en/unicode/U+1E43 is the first thing that comes up.

Thanks! I've scrapped to extract the information to the CSV format: https://pastebin.com/u69Z49j7. There are some mistakes in there, but it's a good starting point.

rocky commented 3 years ago

1,000+ entries. That's a lot!

When I look at this there is something that seems missing. "Unicode" I think is the Unicode representation of Code-point. In some cases though, and this is what the whole issue is about, what appears in the 1st or left-most column, doesn't look like what the Plain text says it is.

In particular, for \[FormalA] and unicode U+F800, I see an "a" with no dot under it. Instead U-1EA1 (ạ) is a choice that most displays (such as the one you are probably looking at) will show more correctly.

GarkGarcia commented 3 years ago

1,000+ entries. That's a lot!

When I look at this there is something that seems missing. "Unicode" I think is the Unicode representation of Code-point. In some cases though, and this is what the whole issue is about, what appears in the 1st or left-most column, doesn't look like what the Plain text says it is.

In particular, for \[FormalA] and unicode U+F800, I see an "a" with no dot under it. Instead U-1EA1 (ạ) is a choice that most displays (such as the one you are probably looking at) will show more correctly.

Yeah, I basically extracted the information from the website you sent me, so I think the "Unicode" column is the unicode character used by Mathematica to represent the characters from the "Plain text" column. We still need to to fill-in the unicode equivalents.

GarkGarcia commented 3 years ago

I'll go over the characters I know the unicode equivalent of. @rocky we should probably add this spreadsheet to the developer documentation or something.

GarkGarcia commented 3 years ago

Ok, filled up most of the unicode-equivalents (about 70% of them): https://pastebin.com/QBn1dK3n

A lot of the missing ones simply don't exist (there is no unicode character that I could find that represents the given simbols). It's unclear to me how we should deal with those. Perhaps we could just replace them with their plain text representation as @mmatera suggested at some point.

We're still missing the translations for characters in certain mathematical fonts (such as the letters with a dot underneath them that @rocky mentioned). There are also some errors in the "ESC-alias column", I guess we'll have to fix them by hand at some point.

My idea is that this spreadsheet could be used to proceduraly generate the translation tables used in https://github.com/mathics/Mathics/pull/1077, as well as the inputrc files used by mathicsscript.

I'd appreciate if anyone could take the time to fill in the missing items. I've spend about 3 hours on this unicode hellhole, I'm too tired to continue for now. Again, I've done about 70% of the work already, and I estimate 20% of it can't be done (because there are no unicode counterparts to the WL symbols), so the work that's left isn't as scary as it looks at first glance.

GarkGarcia commented 3 years ago

PS: \[DifferentialD] is one of the characters I filled in, so https://github.com/mathics/Mathics/issues/206 should be dealt-with by this.

rocky commented 3 years ago

What I'd love to see is a table of just the cases where there are standard Unicode equivalent's that are useable where the WL Unicode is not useable. Those are precisely the ones that front-ends want to know about and use. And having the Unicode value for them would be nice too.

Other than that, thanks for the good and hard work . Yes, you can spend hours at this stuff. But hopelfully having doing so will avioid countless hours of other doing the same and avoid countless confusion.

If you want to add this to the developer docks sure, go ahead. (But again I suggest that the most needed, most unique and most useful part is the part about the mismatches.)

On an unrelated node: for the site redesign, I mentioned this the newsgtroup and on slack the other day as I said I would do.

So far now comments. If we don't hear anything, maybe New Years we can change the home page - ring out the old and ring in the new.

GarkGarcia commented 3 years ago

What I'd love to see is a table of just the cases where there are standard Unicode equivalent's that are useable where the WL Unicode is not useable. Those are precisely the ones that front-ends want to know about and use.

Makes sense. As soon as the "complete" table is done we can very easily filter the specific rows that fit your description. I still think having the complete table somewhere is useful too. I don't have any specific use-case in mind, but it took as a lot of work to extract this information and we definitively don't want to have to do it again at some other point. Of coarse, the "complete" table probably wouldn't be used by the interpreter or the front-ends, but it would be nice having it somewhere in mathics-developer-guide.

And having the Unicode value for them would be nice too.

Sure, this could be extracted from the "complete" table by a script. Again, as soon as I'm done with the "complete" table I'll write a script to take care of this.

On an unrelated node: for the site redesign, I mentioned this the newsgtroup and on slack the other day as I said I would do.

Thanks!

So far now comments. If we don't hear anything, maybe New Years we can change the home page - ring out the old and ring in the new.

Sounds reasonable.

GarkGarcia commented 3 years ago

Finally, the complete table is done! https://pastebin.com/jS9NerSL

There are still some symbols missing that might have a unico equivalent, but the current table covers all reasonable symbols (it turns out there's an awful lot of obscure symbols in WL).

@rocky I'll go ahead and generate the table you've asked for.

GarkGarcia commented 3 years ago

What I'd love to see is a table of just the cases where there are standard Unicode equivalent's that are useable where the WL Unicode is not useable. Those are precisely the ones that front-ends want to know about and use. And having the Unicode value for them would be nice too.

Here's the table you asked for: https://pastebin.com/TF0qUk1v

GarkGarcia commented 3 years ago

PS: Happy (upcoming) new year everybody!

rocky commented 3 years ago

What I'd love to see is a table of just the cases where there are standard Unicode equivalent's that are useable where the WL Unicode is not useable. Those are precisely the ones that front-ends want to know about and use. And having the Unicode value for them would be nice too.

Here's the table you asked for: https://pastebin.com/TF0qUk1v

Thanks! The It doesnt look like there will be a problem using the composites as far as I can tell.

The last remaining step to finish this, is to turn this into a couple of Python diction that we can use. Thanks.

GarkGarcia commented 3 years ago

The last remaining step to finish this, is to turn this into a couple of Python diction that we can use.

Yep. I'll wait for your and @mmatera's input on the issues mentioned in https://github.com/mathics/Mathics/pull/1077#issuecomment-753089576 before converting the data to Python dictionaries.

Thanks.

No problem 😁️

GarkGarcia commented 3 years ago

@mathics/maintainers I guess this could be closed since #1077 was merged (?)

1077 is more of a quick and dirty patch in my mind though, the bigger issue is still present.

rocky commented 3 years ago

@GarkGarcia feel free to keep going and/or work on what ever bigger issues bother you.

You can keep this issue open or keep the #1077 branch around if that helps. (Or delete it if it doesn't help).

In the meantime we have progress. And probably close enough to anything anyone will need in the forseeable future.

What we've seen time and time again is that if a branch doesn't get merged in a few months there is a chance that branch will be useless or a big hassle to try to get it not to have merge conflicts.