Add analysis of kerning pairs and triplets

burodepeper commented 6 years ago

I'd like to have the ability to see sorted lists of all glyph pairs and triplets for kerning and ligature purposes.

I'm interested in two variations:

1) Combinations of alphanumeric characters, limited to tokens. 2) Combinations of any characters, limited to virtual tokens with whitespace as separators.

The input echo "Hello world!"; would yield the following results for glyph pairs:

1) ec ch ho He el ll lo wo or rl ld 2) ec ch ho "H He el ll lo wo or rl ld d! !" ";

Notes:

The results would be returned sorted descending by number of occurrences.
Set 1) is a filtered subset of 2), where any pair that contains a non-alphanumeric character can be ignored.
The results are case-sensitive.

Further improvements:

Allow the definition of some common kerning groups, i.e. [left side: e c o d], so the list of pairs can be reduced in complexity.

chrissimpkins commented 6 years ago

Doing some research into how we can get at these data David. I don't currently have an approach for this but we should be able to come up with something. Assuming that you want to examine multiple programming language types?

burodepeper commented 6 years ago

Maybe it is as simple as:

const input = "echo 'Hello world!';"
const pairs = []
const words = input.split(" ")
words.forEach(word => {
  const numberOfPairs = word.length - 1
  for (position = 0; position < numberOfPairs; position++) {
    const pair = word.substring(position, position + 1)
    if (pairs[pair] === undefined) pairs[pair] = 0
    pairs[pair]++
  }
})

const alphanumericPairs = []
for (const key in pairs) {
  if (key.match(/^[a-zA-Z0-9]+$/)) {
    const value = pairs[key]
    alphanumericPairs[key] = value
  }
}

chrissimpkins commented 6 years ago

You can make that comment more than hypothetical by trying it :) does it work?

chrissimpkins commented 6 years ago

Current approach with regex that I am using to tokenize skips the second letter of each pair as starting character of next pair sequence. I think regex is going to be the way to go though...

burodepeper commented 6 years ago

schermafbeelding 2017-12-10 om 17 33 06

chrissimpkins commented 6 years ago

👍 glad to be of help :)

burodepeper commented 6 years ago

I figured you wanted to write one of your fancy Python scripts for it ; )

chrissimpkins commented 6 years ago

NLTK will be good to look at frequencies of "words" in the source. It may be helpful to break that analysis down further by frequencies of commonly used source code word tokens so that you know where those will be positioned in the source.

chrissimpkins commented 6 years ago

Worth adding tokenizer style scripts like yours above to this repo so that they are available for future use?

burodepeper commented 6 years ago

That can be done in the same loop, basically. I'll PR in a little script. It will be Javascript though. Can you live with that?

chrissimpkins commented 6 years ago

Yeah, any language is fine so long as we can provide instructions on how to execute it. Feel free to add any node or other language scripts that you like. Believe that you should have write access here. Feel free to commit them.

burodepeper commented 6 years ago

Top 25 of all pairs

Note: top 25 is the same as alphanumeric only

Pair	Count
re	535562
in	533661
st	480328
er	472948
on	437225
te	426735
at	371145
es	369706
th	367923
en	363620
le	342193
se	341243
ti	329740
nt	326269
or	301370
et	298788
he	284611
de	277091
ar	268935
co	264515
tr	252544
al	241992
me	233023
ct	232167
is	229123

Top 25 punctuation (sequences)

Punctuation	Count
_	950644
,	521376
.	495027
(	410880
*	284346
=	235834
\	217952
)	207338
;	187490
{	160600
}	150793
$	143448
);	132177
'	129155
->	116950
"	110955
:	101162
#	85641
-	79425
@	77677
//	65263
<	57416
*/	53220
::	46384
/	46231

Top 25 words (alphanumeric sequences)

Word	Count
the	138461
if	106561
0	95045
return	79042
to	73527
a	58388
1	58288
is	54749
of	50873
this	47499
struct	43786
int	43458
in	41061
for	39853
name	38186
end	38093
and	34397
void	32227
array	31923
public	30924
new	30679
2	29669
i	29285
s	28773
static	28509

chrissimpkins commented 6 years ago

What source are you using?

burodepeper commented 6 years ago

That's from everything in the code corpora

burodepeper commented 6 years ago

I just committed the analysis script

chrissimpkins commented 6 years ago

Wow nice. How long did that analysis take to run?

burodepeper commented 6 years ago

14 missississsiissisisispis

burodepeper commented 6 years ago

The word 'google' appears 20,864 times, and 'Google' appears only 880 times.

chrissimpkins commented 6 years ago

Think we should manually (or automate) removal of comment blocks? That may interfere with analysis of 'pure' source tokens.

burodepeper commented 6 years ago

Nah, I think comments are an essential part of source code.

chrissimpkins commented 6 years ago

Agree but they will interfere with your top 25 lists if you are trying to examine language syntax across the text here.

burodepeper commented 6 years ago

True, but that's only an issue if you're examining language syntax, isn't it? ;) I'm thinking the scope is about anything that you are reading in your editor.

burodepeper commented 6 years ago

I am a little worried about the discrepancy between opening (410,880) and closing (207,338) brackets. I hope that's not the result of over 200,000 frowney faces...

chrissimpkins commented 6 years ago

Want to create a set of kerning sheets as text files based on these analyses? Would be useful to have programming language specific glyph pair combinations based on what you find here.

burodepeper commented 6 years ago

Well, there's no point kerning a monospaced typeface, but for something proportionally sized, that would be a good idea. My initial idea was to use these results to focus on what areas are most often used.

Obviously, the results are interesting for monospaced typefaces as well, but more in the sense to get an idea of the balance of certain combinations, and perhaps even to create subtle ligatures to correct awkwardness. I'm going to add an analysis of triplets for that.

chrissimpkins commented 6 years ago

While there is no "kerning" per se there are definitely spacing adjustments that can be made in monospaced faces. I spent a couple of months working on this with kerning sheets in the early days of Hack. It would be interesting to view the glyphs side by side to see if it is necessary to nudge here or there. Sometimes interesting missed spacing issues arise...

chrissimpkins commented 6 years ago

and yes, I think you are correct triples would be the way to do this! I think that I looked at all glyphs between sets of o, n, l, etc...

burodepeper commented 6 years ago

Undoubtedly. Here's a top 25 of triplets, uploading the full report now.

Triplets	Count
ion	244676
tio	209407
the	194167
ing	173320
ent	156050
ect	134280
est	129051
str	127096
---	116794
tur	113763
ate	112084
ter	108514
ret	106761
ati	106678
con	101808
etu	101450
ame	100108
urn	99842
int	96101
ons	90699
sta	89024
tri	87968
cti	83096
for	82378
com	79264

I've also added results with pairs and triplets that contain both at least one alphanumeric and punctuation character. It has some very interesting output.

chrissimpkins commented 6 years ago

:+1:

chrissimpkins commented 6 years ago

Top 10 languages on Github based upon pull request activity:

JavaScript Python Java Ruby PHP C++ CSS C# GO C

I added new issue reports for C# and Go. Thoughts about adding CSS?

burodepeper commented 6 years ago

Thoughts about adding CSS?

I've looked into scraping html and css from Alexa top 500 sites, but ran into some silly issues. Will look again later.

chrissimpkins commented 6 years ago

HTML is a good idea too. Wonder if that analysis used CSS to account for web projects that are HTML/CSS/JS?

burodepeper commented 6 years ago

I basically scraped the entire source. Start with a domain, download html, search source for linked stylesheets and scripts, and download those as well. The source wasn't very representative though; everything is processed and minified, that's not the source that developers see. I think manually collecting content will probably yield better results.

chrissimpkins commented 6 years ago

Might be worth adding the popular HTML/CSS frameworks as part of the HTML collection too...

chrissimpkins commented 6 years ago

One of the few Github projects that can say this :)

screenshot at dec 11 09-45-32

The language rainbow...

burodepeper commented 6 years ago

Looked into that, but it is a different kind of source. It's not really an application of html/css, but rather a limited set of examples, basically. As far as html/css is concerned, I'm thinking of getting the content from a bunch of large news sites, and not just the homepages, but perhaps everything one level deep.

chrissimpkins commented 6 years ago

I believe that when I was adding the initial source, I attempted to keep each language around 1M tokens to try to achieve a balance across the languages

chrissimpkins commented 6 years ago

Are there apps that "de-minify" CSS and JS?

burodepeper commented 6 years ago

Are there apps that "de-minify" CSS and JS?

That is what source maps are for, but I've never seen them used automatically in a way source as this. Only integrated in IDEs/browsers. There are also tools like prettier that format code.

But I think it is more about the things that are minified that you can't undo. Running all js and css through prettier would yield nice results, but it is still not what the developer saw. Whitespace (around punctuation) being the biggest missing item. It might be possible to target the /src directories of the project pages of large(r) open source projects. They tend to contain the raw source code from which the site is built, each using their own styleguides.

chrissimpkins commented 6 years ago

SASS/LESS still a thing? People still using these "transpiled" style syntaxes for web dev?

burodepeper commented 6 years ago

SASS/LESS still a thing? People still using these "transpiled" style syntaxes for web dev?

Pretty much exclusively. The same goes for Javascript. Almost everything is in transpiled in some form, even the simple things just for crossbrowser compatibility. Practically none of the code I put in production is what I actually wrote. It doesn't even look like what I wrote.

chrissimpkins commented 6 years ago

Suggests that all of these pre-transpiled languages are what need to be here then?

chrissimpkins commented 6 years ago

You web devs and your transpiling... Why not just create a language that works from the start ;)

chrissimpkins commented 6 years ago

Go source files are now available. I added the entire Golang repository with all test files removed (included text images etc for testing purposes).

Here is a list of the top fifty tokens in these files:

[',', '(', ')', '=', ':', '{', '}', "''", '//', '``', '[', ']', 'if', '&', '0', '!', '1', 'return', '%', '<', 'the', 'func', 'x', 'err', '==', '>', 'nil', ';', 'for', 'a', 's', '_', 'break', 'true', 'i', 'int', 'is', 'to', 't', 'd', 'v', 'y', 'v.Args', 'of', 'b', 'case', 'string', 'c', 'got', 'in']

source-foundry / code-corpora