limetext / lime

Open source API-compatible alternative to the text editor Sublime Text
http://limetext.github.io
BSD 2-Clause "Simplified" License
15.3k stars 1.06k forks source link

Any support for custom UNICODE encodings? #473

Open milindsmart opened 9 years ago

milindsmart commented 9 years ago

Anybody here interested to allow custom encodings that work on UNICODE? To clarify, I mean an alternative unicode encoding to UTF-8, UTF-16, etc...

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/7079233-any-support-for-custom-unicode-encodings?utm_campaign=plugin&utm_content=tracker%2F282001&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F282001&utm_medium=issues&utm_source=github).
erbridge commented 9 years ago

Yes, that would be a good feature to have. Pull requests would be welcome!

milindsmart commented 9 years ago

Good to know. I'm still new to Go, so as I ramp up and reach some level of comfort with it, can someone give me a quick overview of how Unicode is handled in limetext?

quarnster commented 9 years ago

It's just using Go native utf8 string and []rune literals. Other encodings would have to be converted to/from the Go native format upon load and save.

IIRC there's currently no single point used for all IO operations so that might be a good point to start. Not sure what all the different file IO requirements are, but it might be possible to satisfy them with just a ReadWriteCloser interface.

Each encoding would then satisfy that very same IO interface, and hence can just wrap its corresponding Reader and Writer function which accepts data in one encoding, converts the data to the other encoding and forwards it to the next Reader/Writer in the chain.

quarnster commented 9 years ago

Or if you were hoping to be able to work directly in an encoded format (mmap-ed files comes to mind), you'd have to implement a InnerBufferInterface dealing with that.

milindsmart commented 9 years ago

Go native strings are utf8, but rune[]s are utf32... To be really neutral to all encodings, utf32 seems like an ideal..

And yeah, I read the utf8 and utf16 codec packages for Go. That's exactly the kind of package/module I envision could be created to enable a new encoding. In fact, I'm thinking of an encoding designing GUI tool, where people map what they need to different binary ranges, and the code is automatically generated..

About working directly on encoded representation, I guess it'll be useful only in case of very large files. Is that also one of the aims for Limetext?

quarnster commented 9 years ago

Go native strings are utf8, but rune[]s are utf32... To be really neutral to all encodings, utf32 seems like an ideal..

Type conversion from/to string and []rune is built into the language though. I.e. string([]rune(string([]rune("hello world")))) would work.

In fact, I'm thinking of an encoding designing GUI tool, where people map what they need to different binary ranges, and the code is automatically generated..

I was doing something similar for reverse engineering purposes. Might put that up later today or in the weekend if I get a chance if you'd like to use any of it. It pretty much allows you to specify that a range of memory is in a certain format and it'll use QML to display it.

About working directly on encoded representation, I guess it'll be useful only in case of very large files. Is that also one of the aims for Limetext?

Not personally, but I wouldn't stop someone who finds that useful from submitting a pull request enabling that use case.

milindsmart commented 9 years ago

No I get that conversion is built into the language. But from the point of view of LimeText, the in-memory representation of the content of a text file is what I'm trying to understand... Is it UTF-8 or []rune? From what I saw, it's UTF-8. I'm only saying that from a purist/neutral position, it appears UTF-32 is a better encoding...

I was doing something similar for reverse engineering purposes. Might put that up later today or in the weekend if I get a chance if you'd like to use any of it. It pretty much allows you to specify that a range of memory is in a certain format and it'll use QML to display it.

Sounds interesting... what do you mean by "is in a certain format" ?

Since I'm still learning the ropes, I think I'll defer working on the very-very-large-file use case for later.

milindsmart commented 9 years ago

Just to confirm, this enhancement would be entirely implemented within the limetext/text repository, and not involving the limetext/lime repo... right?

quarnster commented 9 years ago

But from the point of view of LimeText, the in-memory representation of the content of a text file is what I'm trying to understand... Is it UTF-8 or []rune?

Neither actually. It's a rope-ish hierarchical data structure with individual nodes dealing with []rune slices.

But the details of that is hidden behind the InnerBufferInterface which only deals with positions and []rune's.

For ease of use there's also a Buffer interface which expands the InnerBufferInterface with helper functions that allows you to work with strings if that's preferred.

Sounds interesting... what do you mean by "is in a certain format" ?

Well, this is getting of topic for this repo, but what I did was to have a Formatter interface that takes a []byte slice and returns a string. One formatter disassembles the data, one shows it in "hexdump" format, one displays it as if it was a PEM rsa public key, etc.

Just to confirm, this enhancement would be entirely implemented within the limetext/text repository, and not involving the limetext/lime repo... right?

Depends on what you mean with "this enhancement". If you intend to look into multiple text encodings it would be better to have your own repository for that, or if you'd like to donate it to the limetext org a separate repo here, as it'll likely be useful for others who are interested in dealing with different encodings but have no interest at all in limetext otherwise.

At the very least it should go into its own package IMO as it's not strictly related to any existing functionality but rather expands current functionality with a new dimension.

milindsmart commented 9 years ago

Neither actually. It's a rope-ish hierarchical data structure with individual nodes dealing with []rune slices.

Yeah I finally got it, after also reading the source and the article linked w.r.t the rope-ish data structure. But it's clear that the internal encoding of the editor is UTF-32, which is good.

Depends on what you mean with "this enhancement". If you intend to look into multiple text encodings it would be better to have your own repository for that, or if you'd like to donate it to the limetext org a separate repo here, as it'll likely be useful for others who are interested in dealing with different encodings but have no interest at all in limetext otherwise.

Yes most definitely I would host the encoding maps separately. But the architecture/entry point that accepts a plugin with encode/decode functions has to be part of the editor itself. Each editor also needs its own implementation of the encode/decode functions, I don't think DLLs are a good option, can't use them cross platform.

quarnster commented 9 years ago

Not sure why DLLs are mentioned. If they were considered, why not just use gconv?

quarnster commented 9 years ago

And for reference, https://github.com/qiniu/iconv enables one such approach

quarnster commented 9 years ago

Also looks like there are some pure go interfaces and concrete implementations in the golang.org/x/text package. Nice!

milindsmart commented 9 years ago

yeah eventually I would add these encodings to gconv if/when it gets widespread.... EDIT: After checking out iconv, that seems like a nice starting point... they have done much of the work I had in mind, including the codec source code generator given the character map. Maybe you could include a version of iconv in LimeText, in which case writing a new plugin is tantamount to adding a new encoding to iconv...? Being so solid and standard, I hope i can persuade more people to take that approach too. Perhaps a separate directory for custom encodings.

What I'm trying to do is not invent a new encoding to use in programming languages, which can be done quite easily... I want to be able to use these new encodings with text files, which requires the editor to be able to be augmented with new encoding plugins. I mentioned DLLs because that's one way plugins are made. It's not ideal because it's tied to Windows. I would prefer each such encoding to be in the form of a spec, which is (manually/automatically) converted into Go source code in a form suitable to be plugged in into LimeText. Right now I'm looking for some help on how to enable the plugging in part in Lime.Then UTF-8 would be the first test case...

I hope you got the general order of how I think this should go. Please do point out any deficiencies, factors I have not considered, and your suggestions about what order I should implement these aspects.

milindsmart commented 9 years ago

Bump... I'd love to have some suggestions on where I should start.

quarnster commented 9 years ago

I'd suggest you create a github repo with your encoding and make it satisfy the https://godoc.org/golang.org/x/text/encoding#Encoding.

For hooking into lime, we'd need to make sure all our IO operations go via io.Reader and io.Writer, and if they aren't figure out why and if they can be changed to do so. That way we can just use https://godoc.org/golang.org/x/text/transform#NewReader to support reading files in any encoding and https://godoc.org/golang.org/x/text/transform#NewWriter to support writing files in any encoding.