source-foundry / code-corpora

Source code corpora for text analysis
Other
29 stars 6 forks source link

Add analysis of kerning pairs and triplets #4

Closed burodepeper closed 5 years ago

burodepeper commented 6 years ago

I'd like to have the ability to see sorted lists of all glyph pairs and triplets for kerning and ligature purposes.

I'm interested in two variations:

1) Combinations of alphanumeric characters, limited to tokens. 2) Combinations of any characters, limited to virtual tokens with whitespace as separators.

The input echo "Hello world!"; would yield the following results for glyph pairs:

1) ec ch ho He el ll lo wo or rl ld 2) ec ch ho "H He el ll lo wo or rl ld d! !" ";

Notes:

Further improvements:

chrissimpkins commented 6 years ago

Doing some research into how we can get at these data David. I don't currently have an approach for this but we should be able to come up with something. Assuming that you want to examine multiple programming language types?

burodepeper commented 6 years ago

Maybe it is as simple as:

const input = "echo 'Hello world!';"
const pairs = []
const words = input.split(" ")
words.forEach(word => {
  const numberOfPairs = word.length - 1
  for (position = 0; position < numberOfPairs; position++) {
    const pair = word.substring(position, position + 1)
    if (pairs[pair] === undefined) pairs[pair] = 0
    pairs[pair]++
  }
})

const alphanumericPairs = []
for (const key in pairs) {
  if (key.match(/^[a-zA-Z0-9]+$/)) {
    const value = pairs[key]
    alphanumericPairs[key] = value
  }
}
chrissimpkins commented 6 years ago

You can make that comment more than hypothetical by trying it :) does it work?

chrissimpkins commented 6 years ago

Current approach with regex that I am using to tokenize skips the second letter of each pair as starting character of next pair sequence. I think regex is going to be the way to go though...

burodepeper commented 6 years ago

schermafbeelding 2017-12-10 om 17 33 06

chrissimpkins commented 6 years ago

👍 glad to be of help :)

burodepeper commented 6 years ago

I figured you wanted to write one of your fancy Python scripts for it ; )

chrissimpkins commented 6 years ago

NLTK will be good to look at frequencies of "words" in the source. It may be helpful to break that analysis down further by frequencies of commonly used source code word tokens so that you know where those will be positioned in the source.

chrissimpkins commented 6 years ago

Worth adding tokenizer style scripts like yours above to this repo so that they are available for future use?

burodepeper commented 6 years ago

That can be done in the same loop, basically. I'll PR in a little script. It will be Javascript though. Can you live with that?

chrissimpkins commented 6 years ago

Yeah, any language is fine so long as we can provide instructions on how to execute it. Feel free to add any node or other language scripts that you like. Believe that you should have write access here. Feel free to commit them.

burodepeper commented 6 years ago

Top 25 of all pairs

Note: top 25 is the same as alphanumeric only

Pair Count
re 535562
in 533661
st 480328
er 472948
on 437225
te 426735
at 371145
es 369706
th 367923
en 363620
le 342193
se 341243
ti 329740
nt 326269
or 301370
et 298788
he 284611
de 277091
ar 268935
co 264515
tr 252544
al 241992
me 233023
ct 232167
is 229123

Top 25 punctuation (sequences)

Punctuation Count
_ 950644
, 521376
. 495027
( 410880
* 284346
= 235834
\ 217952
) 207338
; 187490
{ 160600
} 150793
$ 143448
); 132177
' 129155
-> 116950
" 110955
: 101162
# 85641
- 79425
@ 77677
// 65263
< 57416
*/ 53220
:: 46384
/ 46231

Top 25 words (alphanumeric sequences)

Word Count
the 138461
if 106561
0 95045
return 79042
to 73527
a 58388
1 58288
is 54749
of 50873
this 47499
struct 43786
int 43458
in 41061
for 39853
name 38186
end 38093
and 34397
void 32227
array 31923
public 30924
new 30679
2 29669
i 29285
s 28773
static 28509
chrissimpkins commented 6 years ago

What source are you using?

burodepeper commented 6 years ago

That's from everything in the code corpora

burodepeper commented 6 years ago

I just committed the analysis script

chrissimpkins commented 6 years ago

Wow nice. How long did that analysis take to run?

burodepeper commented 6 years ago

14 missississsiissisisispis

burodepeper commented 6 years ago

The word 'google' appears 20,864 times, and 'Google' appears only 880 times.

chrissimpkins commented 6 years ago

Think we should manually (or automate) removal of comment blocks? That may interfere with analysis of 'pure' source tokens.

burodepeper commented 6 years ago

Nah, I think comments are an essential part of source code.

chrissimpkins commented 6 years ago

Agree but they will interfere with your top 25 lists if you are trying to examine language syntax across the text here.

burodepeper commented 6 years ago

True, but that's only an issue if you're examining language syntax, isn't it? ;) I'm thinking the scope is about anything that you are reading in your editor.

burodepeper commented 6 years ago

I am a little worried about the discrepancy between opening (410,880) and closing (207,338) brackets. I hope that's not the result of over 200,000 frowney faces...

chrissimpkins commented 6 years ago

Want to create a set of kerning sheets as text files based on these analyses? Would be useful to have programming language specific glyph pair combinations based on what you find here.

burodepeper commented 6 years ago

Well, there's no point kerning a monospaced typeface, but for something proportionally sized, that would be a good idea. My initial idea was to use these results to focus on what areas are most often used.

Obviously, the results are interesting for monospaced typefaces as well, but more in the sense to get an idea of the balance of certain combinations, and perhaps even to create subtle ligatures to correct awkwardness. I'm going to add an analysis of triplets for that.

chrissimpkins commented 6 years ago

While there is no "kerning" per se there are definitely spacing adjustments that can be made in monospaced faces. I spent a couple of months working on this with kerning sheets in the early days of Hack. It would be interesting to view the glyphs side by side to see if it is necessary to nudge here or there. Sometimes interesting missed spacing issues arise...

chrissimpkins commented 6 years ago

and yes, I think you are correct triples would be the way to do this! I think that I looked at all glyphs between sets of o, n, l, etc...

burodepeper commented 6 years ago

Undoubtedly. Here's a top 25 of triplets, uploading the full report now.

Triplets Count
ion 244676
tio 209407
the 194167
ing 173320
ent 156050
ect 134280
est 129051
str 127096
--- 116794
tur 113763
ate 112084
ter 108514
ret 106761
ati 106678
con 101808
etu 101450
ame 100108
urn 99842
int 96101
ons 90699
sta 89024
tri 87968
cti 83096
for 82378
com 79264

I've also added results with pairs and triplets that contain both at least one alphanumeric and punctuation character. It has some very interesting output.

chrissimpkins commented 6 years ago

:+1:

chrissimpkins commented 6 years ago

Top 10 languages on Github based upon pull request activity:

JavaScript Python Java Ruby PHP C++ CSS C# GO C

I added new issue reports for C# and Go. Thoughts about adding CSS?

burodepeper commented 6 years ago

Thoughts about adding CSS?

I've looked into scraping html and css from Alexa top 500 sites, but ran into some silly issues. Will look again later.

chrissimpkins commented 6 years ago

HTML is a good idea too. Wonder if that analysis used CSS to account for web projects that are HTML/CSS/JS?

burodepeper commented 6 years ago

I basically scraped the entire source. Start with a domain, download html, search source for linked stylesheets and scripts, and download those as well. The source wasn't very representative though; everything is processed and minified, that's not the source that developers see. I think manually collecting content will probably yield better results.

chrissimpkins commented 6 years ago

Might be worth adding the popular HTML/CSS frameworks as part of the HTML collection too...

chrissimpkins commented 6 years ago

One of the few Github projects that can say this :)

screenshot at dec 11 09-45-32

The language rainbow...

burodepeper commented 6 years ago

Looked into that, but it is a different kind of source. It's not really an application of html/css, but rather a limited set of examples, basically. As far as html/css is concerned, I'm thinking of getting the content from a bunch of large news sites, and not just the homepages, but perhaps everything one level deep.

chrissimpkins commented 6 years ago

I believe that when I was adding the initial source, I attempted to keep each language around 1M tokens to try to achieve a balance across the languages

chrissimpkins commented 6 years ago

Are there apps that "de-minify" CSS and JS?

burodepeper commented 6 years ago

Are there apps that "de-minify" CSS and JS?

That is what source maps are for, but I've never seen them used automatically in a way source as this. Only integrated in IDEs/browsers. There are also tools like prettier that format code.

But I think it is more about the things that are minified that you can't undo. Running all js and css through prettier would yield nice results, but it is still not what the developer saw. Whitespace (around punctuation) being the biggest missing item. It might be possible to target the /src directories of the project pages of large(r) open source projects. They tend to contain the raw source code from which the site is built, each using their own styleguides.

chrissimpkins commented 6 years ago

SASS/LESS still a thing? People still using these "transpiled" style syntaxes for web dev?

burodepeper commented 6 years ago

SASS/LESS still a thing? People still using these "transpiled" style syntaxes for web dev?

Pretty much exclusively. The same goes for Javascript. Almost everything is in transpiled in some form, even the simple things just for crossbrowser compatibility. Practically none of the code I put in production is what I actually wrote. It doesn't even look like what I wrote.

chrissimpkins commented 6 years ago

Suggests that all of these pre-transpiled languages are what need to be here then?

chrissimpkins commented 6 years ago

You web devs and your transpiling... Why not just create a language that works from the start ;)

chrissimpkins commented 6 years ago

Go source files are now available. I added the entire Golang repository with all test files removed (included text images etc for testing purposes).

Here is a list of the top fifty tokens in these files:

[',', '(', ')', '=', ':', '{', '}', "''", '//', '``', '[', ']', 'if', '&', '0', '!', '1', 'return', '%', '<', 'the', 'func', 'x', 'err', '==', '>', 'nil', ';', 'for', 'a', 's', '_', 'break', 'true', 'i', 'int', 'is', 'to', 't', 'd', 'v', 'y', 'v.Args', 'of', 'b', 'case', 'string', 'c', 'got', 'in']