30b Checkpoint pickle is published with half precision, no bias tensors and no final layers

Jackmin801 commented 1 year ago

The 30b model pickles seem to have no biases.

from tqdm import tqdm
import torch
from pathlib import Path
import pickle

blob_path = Path.home() / Path('.cache/huggingface/hub/models--facebook--galactica-30b/blobs')

keys2blob = {}
errors = {}
blobs = [blob for blob in blob_path.glob('./*') if blob.is_file()]

for blob in tqdm(blobs):
    try:
        keys2blob.update({k: blob for k in torch.load(blob).keys()})
    except pickle.UnpicklingError as e:
        errors[blob] = e

print(f"Num_weights: {len([i for i in keys2blob.keys() if 'weight' in i])}")
print(f"Num_biases: {len([i for i in keys2blob.keys() if 'bias' in i])}")

100%|██████████| 12/12 [00:50<00:00,  4.19s/it]
Num_weights: 290
Num_biases: 0

This is opposed to the 6.7b model which contains a lot of biases.

from tqdm import tqdm
import torch
from pathlib import Path
import pickle

blob_path = Path.home() / Path('.cache/huggingface/hub/models--facebook--galactica-6.7b/blobs')

keys2blob = {}
errors = {}
blobs = [blob for blob in blob_path.glob('./*') if blob.is_file()]

for blob in tqdm(blobs):
    try:
        keys2blob.update({k: blob for k in torch.load(blob).keys()})
    except pickle.UnpicklingError as e:
        errors[blob] = e

print(f"Num_weights: {len([i for i in keys2blob.keys() if 'weight' in i])}")
print(f"Num_biases: {len([i for i in keys2blob.keys() if 'bias' in i])}")

50%|█████     | 4/8 [00:14<00:14,  3.57s/it]
Num_weights: 260
Num_biases: 257

I do not believe I am missing any pickles because the disk usage of the cloned repository tallies with what is displayed by the huggingface site (note that du outputs gibibytes which is likely the cause of the slight discrepancy in raw numbers).

❯ du -csh ./models--facebook--galactica-30b/blobs/*
785M    ./models--facebook--galactica-30b/blobs/0379c39b5a0cb59453b14738ef1d4924e93599aba4e57f2599036e76f36532f6
9.2G    ./models--facebook--galactica-30b/blobs/05db345d4fcca580bed2c6e9d0fe8feead207c2c2fa8384c27c94cbd4ed0e0bf
4.0K    ./models--facebook--galactica-30b/blobs/0967ef424bce6791893e9a57bb952f80fd536e93
9.2G    ./models--facebook--galactica-30b/blobs/0d6ce164b560f4601d48f61c2a8d598106faa9f4b89c39334a712429649b75c8
4.0K    ./models--facebook--galactica-30b/blobs/28e11da7e191492f3f23d2aa35e9b60f8e9becf6
9.2G    ./models--facebook--galactica-30b/blobs/30a274571d49a30bb4d6872e69b96ad191fa22c92427d160c74ce225a566bd71
24K     ./models--facebook--galactica-30b/blobs/98d10d1a52ab2b70f1deff472512cbaa6065e317
9.2G    ./models--facebook--galactica-30b/blobs/aa79446f17da0f3b9f8815a3628c2b1935936ec819f09a5865ce4e3c4ee51aa7
9.2G    ./models--facebook--galactica-30b/blobs/b919005245e2b77d57bf3a73ac18415083aa32b6e2e4e89c96b8d988453a0e7f
4.0K    ./models--facebook--galactica-30b/blobs/bc97f8a9458a1fe096bec5d8ec938a02647bc4bb
9.2G    ./models--facebook--galactica-30b/blobs/c1cad10954e544c44aabd29f31e67292d1bc819d2e7b9842f14fdcef88d58f93
2.1M    ./models--facebook--galactica-30b/blobs/e18054f92dc016b43c940dd1c4a1c5da884539c0
56G     total

mkardas commented 1 year ago

Yes, the models use no biases in general and no element-wise affine transformations in layer norms by design. Can you check if the biases are present in the 6.7B checkpoints that we published or if the biases in your checkpoints are non-zero?

Jackmin801 commented 1 year ago

All the biases in 6.7B checkpoint are 0

Jackmin801 commented 1 year ago

I've checked the other model checkpoints. All of them have bias tensors that are all zero (except 30b which has no bias tensors).

This is the info I have about the checkpoints so far and some notable differences in the 30b checkpoint:

30b is the only model saved in half precision
It does not follow the trend in number of tensors

Size	Parameters	Disk Usage	Bytes / Parameter ratio	Sum(layer.numels)	Data type of tensors
`mini`	125 M	480M	4.0265	163,430,400	{torch.float32: 197}
`base`	1.3 B	5.0G	4.1298	1,417,601,024	{torch.float32: 389}
`standard`	6.7 B	26G	4.1667	6,862,159,872	{torch.float32: 517}
`large`	30 B	56G	2.0043	29,968,103,424	{torch.float16: 290}
`huge`	120 B	453G	4.0534	121,853,747,200	{torch.float32: 1541}

mkardas commented 1 year ago

The checkpoint is now fixed as part of https://huggingface.co/facebook/galactica-30b/discussions/6. All the checkpoints are now fully compatible with OPT architecture, use float16 weights, with layer norm weights set to ones and all the biases set to zero.

paperswithcode / galai

30b Checkpoint pickle is published with half precision, no bias tensors and no final layers #37