If we're gonna define identifiers and token, might as well also add literals, graphlets and also ~quantification~ quantization . I think we could divide the glossary into:
terms that mean something more specific then would be usually the case or are vague to start with e.g. model meaning a modelforge model, words in BOW being any feature extracted from a document, document that means a repo/file or function, etc.
terms that we use in the same ways it is intended but not be well known. Now of course they have Google, but we might as well drop a couple lines to explain the concept. E.g. COOC, quantization, topics, TFIDDF
We constantly confuse terms, so what to say about other developers. I do not want to make it full, but to have a start.
Here is terms list to explain on the first iteration:
Googleable terms we may comment:
@src-d/machine-learning please take a look and add any confusing terms you remember.