rtkclouds / fast-js-language-model

language model implemetation for js
MIT License
11 stars 3 forks source link

Paper to support the current project #1

Open iSevenDays opened 1 year ago

iSevenDays commented 1 year ago

Hi @rtkclouds Thank you for your interesting project!

Do you have any paper or references where I can read more about your idea?

rtkclouds commented 1 year ago

Hi @rtkclouds Thank you for your interesting project!

Do you have any paper or references where I can read more about your idea?

I'm developing this, basically I'm developing a structure of hierarchical embedings vectors, I didn't post anything else because I'm finishing the final version to put it as a release. I believe that this week I still post the rest, but I didn't post anything about it. I tried to talk to some people in the area, but they didn't pay attention so I decided to finish and post it here for those who are looking,

Currently, we have a form with the following structures:

Words: Question: [word1][word2][word3]][word4][word5] Answer: [word3][word4][word1]][word6][word2]

Identifiers: Question: 1 2 3 45 Answer: 3 4 1 6 2

I propose the following:

Words: Question: [word1][word2][word3]][word4][word5] Answer: [word3][word4][word1]][word6][word2]

Identifiers: Question: 1st 2nd 3rd 4th 5th Answer: 3b 4b 1b 6b 2b

Imagine that, using a simple tool like w2v, you create classes derived from the grouping:

1 2 3 4 5 3 4 1 6 2

Note that it is more difficult to relate the identifier to the question or answer, as the distance cannot provide certainty about this, nor does the position or words. However, if we have a question or answer check digit in the identifier, such as "a" for question and "b" for answer, we would have:

1a 2a 3a 3b 4b 5b 3b 4b 1b 6b 2b

By generating derived classes, replacing the value with the corresponding one in the prediction, it is possible to reduce the dimension of the common classes. By doing this recurrently and then back, it is possible to generate hierarchies between questions and answers in a simple and robust way

rtkclouds commented 1 year ago

I believe I can train models with up to 1.4 billion parameters in this version https://github.com/ggerganov/llama.cpp/discussions/1646

rtkclouds commented 1 year ago

Hi @rtkclouds Thank you for your interesting project!

Do you have any paper or references where I can read more about your idea?

I don't have experience with articles nor am I in the area, I work with trade, if you want to write and just add a note so I don't stay out of the story, you can keep this model able to learn up to 4096 tokens forward because I realized that the way learning in trade was impaired if it was done randomly from the beginning, so it has a batch size controller like we control the page of a book when reading, when s increases the batch the model is not lost. other versions I regulate 100% learning by perplexity, always keeping low

iSevenDays commented 1 year ago

@rtkclouds wow that are some impressive results!

My current config is

cfg = {
    'sequenceSize': 256,
    'dimension': 384,
    'arrayDimension': 8,
    'predictSteps': 8,
    'batchSize': 4096 * 5
}

learning_rate = 0.0005 It takes up to 10.22 GB of VRAM. Loss has been decreasing and accurracy increasing, although I need to add actual val/loss metric.

grafik grafik
rtkclouds commented 1 year ago

@rtkclouds wow that are some impressive results!

My current config is

cfg = {
    'sequenceSize': 256,
    'dimension': 384,
    'arrayDimension': 8,
    'predictSteps': 8,
    'batchSize': 4096 * 5
}

learning_rate = 0.0005 It takes up to 10.22 GB of VRAM. Loss has been decreasing and accurracy increasing, although I need to add actual val/loss metric. grafik

grafik

nice ! im doing my best to finish whole version today and upload if there are more people like you it is easy to get something of quality quickly, the codes to start I already have everything, it includes the trader + w2v hierarchical system + the decision trees I realized that it is easier to converge the transformers due to the reduced number of words, that is, they were complex, so I decided to segment the vector representations into simpler classes 55k gpt2 tokens 16k classes 1024 classes 512 ..

that is, the classes are dependent created from the redistribution of the classes generated by a kmenas from the vectors in the position of the vectors, the w2v is run again and a new one is generated creating a kind of UNET logic only in tokens that way we can use it as a tag to train in a generalized way I don't know if you are interested, but if you want to create an article and just mention me as a co-author, it would help me a lot.