Wikitext pipeline - Githubissues

elephantmipt commented 10 months ago

Hi, can you please share pipeline for the wikitext dataset. I found results with 16.3 for mamba and 18 (vs. 18.6 everywhere else) perplexity for the transformer baseline and can not reproduce it. Maybe there is something different in preprocessing etc. Could you provide any details on the preprocessing steps or hyperparameters used that may be different from the default? Understanding those differences could help me reproduce the results.

albertfgu commented 9 months ago

The pipeline should similar to the one from the H3/Hyena repo, which was forked off our internal repo. A couple of noteworthy hyperparameters that seem to differ from their base settings:

train.global_batch_size=128 (131k toks/batch at a seqlen of 1024)
25 epochs total, or around 27000 training steps (<1 hr total training on 8xA100 node)
dropout 0.2-0.3
learning rate 0.0015

elephantmipt commented 9 months ago

Thank you for the detailed reply! I found that the smallest mamba-130m model uses 24 layers instead of 12, according to the [config](). Is this a case for the wikitext experiment? Additionally, I was able to reproduce the results based on the Hyena repo, achieving a best test perplexity of 15.9 without using dropout. This was done using a 24-layer model, which is quite interesting.

test_perp

albertfgu commented 9 months ago

You should able to parameter-match an equivalent Transformer baseline by doubling the layers. Each Mamba block has 6D^2 parameters where D=d_model is the model width. Each Transformer block has 4D^2 params in the MHA and 8D^2 params in the MLP.

Dropout 0.2-0.3 should help significantly on this task. Wikitext-103 is very small so it's more about overfitting than actually modeling the data well, which is why we avoid using it as a benchmark if possible. We actually don't use dropout at all normally, so I think it's even included into the simple Mamba block example in this codebase. You might have to modify the block yourself and experiment with where to place the dropouts (as a start, after the "gate" multiplication and before the final linear might be sensible). There's probably plenty of room for improvement here, we literally didn't tune it at all.

elephantmipt commented 9 months ago

Thank you for clarifying! I found it confusing that the README file mentions a 12-layer model, https://github.com/state-spaces/mamba/blob/2ee7fd287a8f5c826af6f69ae3aad4682c4afd15/README.md?plain=1#L85 while on Hugging Face, there is a 24-layer model.

tridao commented 9 months ago

Thanks, we'll fix that in the README. Mamba 130m has 24 layers, matching Transformers with 12 layers.

tridao commented 9 months ago

Thank you for the detailed reply! I found that the smallest mamba-130m model uses 24 layers instead of 12, according to the config. Is this a case for the wikitext experiment? Additionally, I was able to reproduce the results based on the Hyena repo, achieving a best test perplexity of 15.9 without using dropout. This was done using a 24-layer model, which is quite interesting.

Did you use the hparams that Albert mentioned, or sth else? As Albert said we didn't really tune for wt103, pretty cool that you're getting better numbers on wt103.

albertfgu commented 9 months ago

The README mentions the double layer count right below the table, do you have a suggestion for a presentation that would be more clear?

On Mon, Dec 11, 2023 at 2:34 PM Tri Dao @.***> wrote:

Thank you for the detailed reply! I found that the smallest mamba-130m model uses 24 layers instead of 12, according to the config. Is this a case for the wikitext experiment? Additionally, I was able to reproduce the results based on the Hyena repo, achieving a best test perplexity of 15.9 without using dropout. This was done using a 24-layer model, which is quite interesting.

[image: test_perp] https://private-user-images.githubusercontent.com/37884009/289520859-bbee0616-e868-484f-893f-be54d8d0e8fe.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDIzMjMyMDUsIm5iZiI6MTcwMjMyMjkwNSwicGF0aCI6Ii8zNzg4NDAwOS8yODk1MjA4NTktYmJlZTA2MTYtZTg2OC00ODRmLTg5M2YtYmU1NGQ4ZDBlOGZlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjExVDE5MjgyNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTAyN2IwNzQ3OWRiYTk0NzE1OTM1YWI1YTBjOGFmMGVjNjcyN2I4MzIzNWU4NDI5ZWNmYmM5MDcyMzJmZGE2ZTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.etzkmteO2kLKwoSkaI2NJMYRoVSVu9sZ7-97fa3BokI

Did you use the hparams that Albert mentioned, or sth else? As Albert said we didn't really tune for wt103, pretty cool that you're getting better numbers on wt103.

— Reply to this email directly, view it on GitHub https://github.com/state-spaces/mamba/issues/8#issuecomment-1850754430, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6JKR2MWZMSM375UJBSSWTYI5N4HAVCNFSM6AAAAABAHMFCFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJQG42TINBTGA . You are receiving this because you commented.Message ID: @.***>

elephantmipt commented 9 months ago

The README mentions the double layer count right below the table, do you have a suggestion for a presentation that would be more clear?

I think 96ec4e4 solved all my concerns.

Did you use the hparams that Albert mentioned, or sth else? As Albert said we didn't really tune for wt103, pretty cool that you're getting better numbers on wt103.

I utilized dataloading pipeline from s5 repo and adapted the training loop to work with PyTorch; Long story short here are my hyperparameters (batch_size is global):

# Training
lr: 0.001
weight_decay: 0.25

dtype: "bf16"

lr_schedule: "cosine"
total_steps: 115000
warmup_steps: 1000
save_interval: 200000
log_interval: 150

# Data
dataset: "wikitext103"
vocab_size: 50257
l_max: 1024

data_kwargs:
    batch_size: 16
    batch_size_eval: 16
    num_workers: 16
    pin_memory: False
    data_dir: cache

# Model
d_model: 768
n_layer: 24

I should mention that there are actually 130M parameters in the model, instead of 125M. The better results could possibly be attributed to the presence of these additional 5M parameters. If you are interested, I can share my pipeline with you.

Thank you once again for all your replies.

albertfgu commented 9 months ago

I don't think the parameter increase matters much. The bigger difference is probably a longer training time.

Sawyer117 commented 7 months ago

The README mentions the double layer count right below the table, do you have a suggestion for a presentation that would be more clear?

I think 96ec4e4 solved all my concerns.

Did you use the hparams that Albert mentioned, or sth else? As Albert said we didn't really tune for wt103, pretty cool that you're getting better numbers on wt103.

I utilized dataloading pipeline from s5 repo and adapted the training loop to work with PyTorch; Long story short here are my hyperparameters (batch_size is global):
# Training
lr: 0.001
weight_decay: 0.25

dtype: "bf16"

lr_schedule: "cosine"
total_steps: 115000
warmup_steps: 1000
save_interval: 200000
log_interval: 150

# Data
dataset: "wikitext103"
vocab_size: 50257
l_max: 1024

data_kwargs:
    batch_size: 16
    batch_size_eval: 16
    num_workers: 16
    pin_memory: False
    data_dir: cache

# Model
d_model: 768
n_layer: 24
I should mention that there are actually 130M parameters in the model, instead of 125M. The better results could possibly be attributed to the presence of these additional 5M parameters. If you are interested, I can share my pipeline with you.

Thank you once again for all your replies.

Hi @elephantmipt sorry might be a naive question but are you finetuning here with wiki103 or pretraining on it? Thanks!

state-spaces / mamba

Wikitext pipeline #8