paperswithcode / galai

Model API for GALACTICA
Apache License 2.0
2.67k stars 275 forks source link

30b Checkpoint pickle is published with half precision, no bias tensors and no final layers #37

Closed Jackmin801 closed 1 year ago

Jackmin801 commented 1 year ago

The 30b model pickles seem to have no biases.

from tqdm import tqdm
import torch
from pathlib import Path
import pickle

blob_path = Path.home() / Path('.cache/huggingface/hub/models--facebook--galactica-30b/blobs')

keys2blob = {}
errors = {}
blobs = [blob for blob in blob_path.glob('./*') if blob.is_file()]

for blob in tqdm(blobs):
    try:
        keys2blob.update({k: blob for k in torch.load(blob).keys()})
    except pickle.UnpicklingError as e:
        errors[blob] = e

print(f"Num_weights: {len([i for i in keys2blob.keys() if 'weight' in i])}")
print(f"Num_biases: {len([i for i in keys2blob.keys() if 'bias' in i])}")
100%|██████████| 12/12 [00:50<00:00,  4.19s/it]
Num_weights: 290
Num_biases: 0

This is opposed to the 6.7b model which contains a lot of biases.

from tqdm import tqdm
import torch
from pathlib import Path
import pickle

blob_path = Path.home() / Path('.cache/huggingface/hub/models--facebook--galactica-6.7b/blobs')

keys2blob = {}
errors = {}
blobs = [blob for blob in blob_path.glob('./*') if blob.is_file()]

for blob in tqdm(blobs):
    try:
        keys2blob.update({k: blob for k in torch.load(blob).keys()})
    except pickle.UnpicklingError as e:
        errors[blob] = e

print(f"Num_weights: {len([i for i in keys2blob.keys() if 'weight' in i])}")
print(f"Num_biases: {len([i for i in keys2blob.keys() if 'bias' in i])}")
50%|█████     | 4/8 [00:14<00:14,  3.57s/it]
Num_weights: 260
Num_biases: 257

I do not believe I am missing any pickles because the disk usage of the cloned repository tallies with what is displayed by the huggingface site (note that du outputs gibibytes which is likely the cause of the slight discrepancy in raw numbers).

❯ du -csh ./models--facebook--galactica-30b/blobs/*
785M    ./models--facebook--galactica-30b/blobs/0379c39b5a0cb59453b14738ef1d4924e93599aba4e57f2599036e76f36532f6
9.2G    ./models--facebook--galactica-30b/blobs/05db345d4fcca580bed2c6e9d0fe8feead207c2c2fa8384c27c94cbd4ed0e0bf
4.0K    ./models--facebook--galactica-30b/blobs/0967ef424bce6791893e9a57bb952f80fd536e93
9.2G    ./models--facebook--galactica-30b/blobs/0d6ce164b560f4601d48f61c2a8d598106faa9f4b89c39334a712429649b75c8
4.0K    ./models--facebook--galactica-30b/blobs/28e11da7e191492f3f23d2aa35e9b60f8e9becf6
9.2G    ./models--facebook--galactica-30b/blobs/30a274571d49a30bb4d6872e69b96ad191fa22c92427d160c74ce225a566bd71
24K     ./models--facebook--galactica-30b/blobs/98d10d1a52ab2b70f1deff472512cbaa6065e317
9.2G    ./models--facebook--galactica-30b/blobs/aa79446f17da0f3b9f8815a3628c2b1935936ec819f09a5865ce4e3c4ee51aa7
9.2G    ./models--facebook--galactica-30b/blobs/b919005245e2b77d57bf3a73ac18415083aa32b6e2e4e89c96b8d988453a0e7f
4.0K    ./models--facebook--galactica-30b/blobs/bc97f8a9458a1fe096bec5d8ec938a02647bc4bb
9.2G    ./models--facebook--galactica-30b/blobs/c1cad10954e544c44aabd29f31e67292d1bc819d2e7b9842f14fdcef88d58f93
2.1M    ./models--facebook--galactica-30b/blobs/e18054f92dc016b43c940dd1c4a1c5da884539c0
56G     total

image

mkardas commented 1 year ago

Yes, the models use no biases in general and no element-wise affine transformations in layer norms by design. Can you check if the biases are present in the 6.7B checkpoints that we published or if the biases in your checkpoints are non-zero?

Jackmin801 commented 1 year ago

All the biases in 6.7B checkpoint are 0

Jackmin801 commented 1 year ago

I've checked the other model checkpoints. All of them have bias tensors that are all zero (except 30b which has no bias tensors).

This is the info I have about the checkpoints so far and some notable differences in the 30b checkpoint:

Size Parameters Disk Usage Bytes / Parameter ratio Sum(layer.numels) Data type of tensors
mini 125 M 480M 4.0265 163,430,400 {torch.float32: 197}
base 1.3 B 5.0G 4.1298 1,417,601,024 {torch.float32: 389}
standard 6.7 B 26G 4.1667 6,862,159,872 {torch.float32: 517}
large 30 B 56G 2.0043 29,968,103,424 {torch.float16: 290}
huge 120 B 453G 4.0534 121,853,747,200 {torch.float32: 1541}
mkardas commented 1 year ago

The checkpoint is now fixed as part of https://huggingface.co/facebook/galactica-30b/discussions/6. All the checkpoints are now fully compatible with OPT architecture, use float16 weights, with layer norm weights set to ones and all the biases set to zero.