Subword detection of words which areallinlowercase

spadgos commented 6 years ago

Because of [reasons], a codebase I work on has a lot of identifiers where multiple words are concatenated in lower case, somethinglikethis. I was thinking that it might be nice if this extension supported navigating within the subwords in those cases. There's an example algorithm of how you can achieve that here: https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words

If you think perhaps this is a good fit for this extension, I'm happy to put together a PR.

I imagine for this extension, it'd only use the subword detection if the word is all lower case letters [a-z], so no underscores or camelCase. Otherwise, the behaviour would remain the same as it is now. The algorithm preferences longer words, so iteratorthing would be interpreted as iterator thing and not it er at or thing.

The downsides I could see of this would be that you need a decent dictionary for the results to be any good. That is: a lot of programming terms aren't considered real words by some dictionaries (case in point: "iterator" isn't a word), but that could be alleviated by using the same sort of dictionary files that other extensions use.

Anyway, let me know if you think this might be interesting.

Cheers

ow-- commented 6 years ago

If we could get it to just work like magic, without swamping the extension in configuration and complexity or tanking the performance, it would be an awesome addition! It's certainly something I miss occasionally.

By all means, see where it leads. If nothing else you'd be able to keep sane in your unfortunate codebase. 😉

Thinking about a proper release though. Do you reckon it would feel manageable keeping a bundled dictionary? I don't really like the thought of forcing megabytes of word lists on people not interested in the feature or somehow actually not coding in english. Looking over the dictionary infrastructure linked it seems like quite some work to keep them external. And the same ones don't look to be prepared to be reusable right now since they register with the mothership extension.

If the naive approach to dictionary lookup isn't performant enough the trie/dawg path could be investigated but that would lead towards precompilation and maybe binary serialization to keep size down, etc. The end result is pretty cool, but there could be some work there too.

spadgos commented 6 years ago

Good points there. I'd previously put together a proof of concept of this just for a fun side project. I'd tried it out with a scrabble dictionary as well as a list of the most common words on wikipedia. The lists were ~1MB-1.5MB on disk, which might be a bit much. It's not terrible, but it's far from instant to parse that into a trie -- I definitely wouldn't want to do it on demand since it would feel pretty sluggish. I don't know how bad it would be to keep that much data in memory the whole time :slightly_frowning_face:

ow-- / vscode-subword-navigation

Subword detection of words which areallinlowercase #31