karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.19k stars 865 forks source link

Alternative to „Vector Representation Pre-training“ possible? #30

Open Heavy02011 opened 9 months ago

Heavy02011 commented 9 months ago

In the paper Driving with LLMs a „Vector Representation Pre-training“ is used to interact with llama-7b as follows: „In our framework, we aim to convert vector representations into language using a structured language generator to facilitate the grounding the vector representation into LLMs. Since our object-level vectors contain semantically significant attributes, such as the number of cars and pedestrians, their respective locations, orientations, speeds, bounding boxes and other attributes, we employ a structured language generator (lanGen) function to craft pseudo-language labels derived from the vector space…“

Question: can we apply / modify minbpe to achieve the same? Or is this a silly question…****

marcov-dart commented 8 months ago

Sounds interesting. I should read the paper first, but just going of the text you provided I am unsure how you are relating it to bpe. They are going from vector representations to pseudo-language. Bpe allows tekst to be encoded as a sequence of integers where the integers then are used to lookup the vector representation. So the other way around basically. But maybe you are thinking about reversing the principle?

It is something I have been thinking about and I tried to get some discussion started in another post. Not much response as of yet unfortunately. I was thinking a scheme where the tokeniser doesn't just give a sequence of tokens but also a set of modifiers to the tokens. The integer we would use as normal to lookup the token embedding, but the modifiers we map onto synthetic (non-trained) dimensions of the token embedding.

The tokeniser would have to be very different from just doing bpe. Instead we would make an attempt of building a tokeniser using a-priori knowledge of different languages. Just as an example:

" bus", "Bus", "busses" for instance would all map to the same integer and the synthetic dimensions would provide the information of the specific variation. Maybe language could be a modifier too. So that the same word in different languages also have the same integer.