Closed burodepeper closed 5 years ago
Doing some research into how we can get at these data David. I don't currently have an approach for this but we should be able to come up with something. Assuming that you want to examine multiple programming language types?
Maybe it is as simple as:
const input = "echo 'Hello world!';"
const pairs = []
const words = input.split(" ")
words.forEach(word => {
const numberOfPairs = word.length - 1
for (position = 0; position < numberOfPairs; position++) {
const pair = word.substring(position, position + 1)
if (pairs[pair] === undefined) pairs[pair] = 0
pairs[pair]++
}
})
const alphanumericPairs = []
for (const key in pairs) {
if (key.match(/^[a-zA-Z0-9]+$/)) {
const value = pairs[key]
alphanumericPairs[key] = value
}
}
You can make that comment more than hypothetical by trying it :) does it work?
Current approach with regex that I am using to tokenize skips the second letter of each pair as starting character of next pair sequence. I think regex is going to be the way to go though...
👍 glad to be of help :)
I figured you wanted to write one of your fancy Python scripts for it ; )
NLTK will be good to look at frequencies of "words" in the source. It may be helpful to break that analysis down further by frequencies of commonly used source code word tokens so that you know where those will be positioned in the source.
Worth adding tokenizer style scripts like yours above to this repo so that they are available for future use?
That can be done in the same loop, basically. I'll PR in a little script. It will be Javascript though. Can you live with that?
Yeah, any language is fine so long as we can provide instructions on how to execute it. Feel free to add any node or other language scripts that you like. Believe that you should have write access here. Feel free to commit them.
Note: top 25 is the same as alphanumeric only
Pair | Count |
---|---|
re | 535562 |
in | 533661 |
st | 480328 |
er | 472948 |
on | 437225 |
te | 426735 |
at | 371145 |
es | 369706 |
th | 367923 |
en | 363620 |
le | 342193 |
se | 341243 |
ti | 329740 |
nt | 326269 |
or | 301370 |
et | 298788 |
he | 284611 |
de | 277091 |
ar | 268935 |
co | 264515 |
tr | 252544 |
al | 241992 |
me | 233023 |
ct | 232167 |
is | 229123 |
Punctuation | Count |
---|---|
_ | 950644 |
, | 521376 |
. | 495027 |
( | 410880 |
* | 284346 |
= | 235834 |
\ | 217952 |
) | 207338 |
; | 187490 |
{ | 160600 |
} | 150793 |
$ | 143448 |
); | 132177 |
' | 129155 |
-> | 116950 |
" | 110955 |
: | 101162 |
# | 85641 |
- | 79425 |
@ | 77677 |
// | 65263 |
< | 57416 |
*/ | 53220 |
:: | 46384 |
/ | 46231 |
Word | Count |
---|---|
the | 138461 |
if | 106561 |
0 | 95045 |
return | 79042 |
to | 73527 |
a | 58388 |
1 | 58288 |
is | 54749 |
of | 50873 |
this | 47499 |
struct | 43786 |
int | 43458 |
in | 41061 |
for | 39853 |
name | 38186 |
end | 38093 |
and | 34397 |
void | 32227 |
array | 31923 |
public | 30924 |
new | 30679 |
2 | 29669 |
i | 29285 |
s | 28773 |
static | 28509 |
What source are you using?
That's from everything in the code corpora
I just committed the analysis script
Wow nice. How long did that analysis take to run?
14 missississsiissisisispis
The word 'google' appears 20,864 times, and 'Google' appears only 880 times.
Think we should manually (or automate) removal of comment blocks? That may interfere with analysis of 'pure' source tokens.
Nah, I think comments are an essential part of source code.
Agree but they will interfere with your top 25 lists if you are trying to examine language syntax across the text here.
True, but that's only an issue if you're examining language syntax, isn't it? ;) I'm thinking the scope is about anything that you are reading in your editor.
I am a little worried about the discrepancy between opening (410,880) and closing (207,338) brackets. I hope that's not the result of over 200,000 frowney faces...
Want to create a set of kerning sheets as text files based on these analyses? Would be useful to have programming language specific glyph pair combinations based on what you find here.
Well, there's no point kerning a monospaced typeface, but for something proportionally sized, that would be a good idea. My initial idea was to use these results to focus on what areas are most often used.
Obviously, the results are interesting for monospaced typefaces as well, but more in the sense to get an idea of the balance of certain combinations, and perhaps even to create subtle ligatures to correct awkwardness. I'm going to add an analysis of triplets for that.
While there is no "kerning" per se there are definitely spacing adjustments that can be made in monospaced faces. I spent a couple of months working on this with kerning sheets in the early days of Hack. It would be interesting to view the glyphs side by side to see if it is necessary to nudge here or there. Sometimes interesting missed spacing issues arise...
and yes, I think you are correct triples would be the way to do this! I think that I looked at all glyphs between sets of o
, n
, l
, etc...
Undoubtedly. Here's a top 25 of triplets, uploading the full report now.
Triplets | Count |
---|---|
ion | 244676 |
tio | 209407 |
the | 194167 |
ing | 173320 |
ent | 156050 |
ect | 134280 |
est | 129051 |
str | 127096 |
--- | 116794 |
tur | 113763 |
ate | 112084 |
ter | 108514 |
ret | 106761 |
ati | 106678 |
con | 101808 |
etu | 101450 |
ame | 100108 |
urn | 99842 |
int | 96101 |
ons | 90699 |
sta | 89024 |
tri | 87968 |
cti | 83096 |
for | 82378 |
com | 79264 |
I've also added results with pairs and triplets that contain both at least one alphanumeric and punctuation character. It has some very interesting output.
:+1:
Top 10 languages on Github based upon pull request activity:
JavaScript Python Java Ruby PHP C++ CSS C# GO C
I added new issue reports for C# and Go. Thoughts about adding CSS?
Thoughts about adding CSS?
I've looked into scraping html and css from Alexa top 500 sites, but ran into some silly issues. Will look again later.
HTML is a good idea too. Wonder if that analysis used CSS to account for web projects that are HTML/CSS/JS?
I basically scraped the entire source. Start with a domain, download html, search source for linked stylesheets and scripts, and download those as well. The source wasn't very representative though; everything is processed and minified, that's not the source that developers see. I think manually collecting content will probably yield better results.
Might be worth adding the popular HTML/CSS frameworks as part of the HTML collection too...
One of the few Github projects that can say this :)
The language rainbow...
Looked into that, but it is a different kind of source. It's not really an application of html/css, but rather a limited set of examples, basically. As far as html/css is concerned, I'm thinking of getting the content from a bunch of large news sites, and not just the homepages, but perhaps everything one level deep.
I believe that when I was adding the initial source, I attempted to keep each language around 1M tokens to try to achieve a balance across the languages
Are there apps that "de-minify" CSS and JS?
Are there apps that "de-minify" CSS and JS?
That is what source maps are for, but I've never seen them used automatically in a way source as this. Only integrated in IDEs/browsers. There are also tools like prettier that format code.
But I think it is more about the things that are minified that you can't undo. Running all js and css through prettier would yield nice results, but it is still not what the developer saw. Whitespace (around punctuation) being the biggest missing item. It might be possible to target the /src directories of the project pages of large(r) open source projects. They tend to contain the raw source code from which the site is built, each using their own styleguides.
SASS/LESS still a thing? People still using these "transpiled" style syntaxes for web dev?
SASS/LESS still a thing? People still using these "transpiled" style syntaxes for web dev?
Pretty much exclusively. The same goes for Javascript. Almost everything is in transpiled in some form, even the simple things just for crossbrowser compatibility. Practically none of the code I put in production is what I actually wrote. It doesn't even look like what I wrote.
Suggests that all of these pre-transpiled languages are what need to be here then?
You web devs and your transpiling... Why not just create a language that works from the start ;)
Go source files are now available. I added the entire Golang repository with all test files removed (included text images etc for testing purposes).
Here is a list of the top fifty tokens in these files:
[',', '(', ')', '=', ':', '{', '}', "''", '//', '``', '[', ']', 'if', '&', '0', '!', '1', 'return', '%', '<', 'the', 'func', 'x', 'err', '==', '>', 'nil', ';', 'for', 'a', 's', '_', 'break', 'true', 'i', 'int', 'is', 'to', 't', 'd', 'v', 'y', 'v.Args', 'of', 'b', 'case', 'string', 'c', 'got', 'in']
I'd like to have the ability to see sorted lists of all glyph pairs and triplets for kerning and ligature purposes.
I'm interested in two variations:
1) Combinations of alphanumeric characters, limited to tokens. 2) Combinations of any characters, limited to virtual tokens with whitespace as separators.
The input
echo "Hello world!";
would yield the following results for glyph pairs:1) ec ch ho He el ll lo wo or rl ld 2) ec ch ho "H He el ll lo wo or rl ld d! !" ";
Notes:
Further improvements: