stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.95k stars 377 forks source link

fix: properly load config from huggingface #284

Closed bclavie closed 9 months ago

bclavie commented 9 months ago

@okhat Currently there's a pretty major issue (for usability purposes) with loading the config from Huggingface:

When loading a pre-trained model, the config is fetched like this (from BaseColBERT for this code example):

self.colbert_config = ColBERTConfig.from_existing(ColBERTConfig.load_from_checkpoint(name_or_path), colbert_config)

Which calls this function:

    def load_from_checkpoint(cls, checkpoint_path):
        if checkpoint_path.endswith('.dnn'):
            dnn = torch_load_dnn(checkpoint_path)
            config, _ = cls.from_deprecated_args(dnn.get('arguments', {}))

            # TODO: FIXME: Decide if the line below will have any unintended consequences. We don't want to overwrite those!
            config.set('checkpoint', checkpoint_path)

            return config

        loaded_config_path = os.path.join(checkpoint_path, 'artifact.metadata')
        if os.path.exists(loaded_config_path):
            loaded_config, _ = cls.from_path(loaded_config_path)
            loaded_config.set('checkpoint', checkpoint_path)

            return loaded_config

However, at this point in time, the model's files haven't been fetched from hugginface, and therefore artifact.metadata doesn't yet exist, though this config is then passed to HF_ColBERT to generate the model. This results in the following situation:

In [1]: from colbert.modeling.base_colbert import BaseColBERT

In [2]: model = BaseColBERT("colbert-ir/colbertv2.0")

In [3]: model.colbert_config
Out[3]: ColBERTConfig(query_token_id='[unused0]', doc_token_id='[unused1]', query_token='[Q]', doc_token='[D]', ncells=None, centroid_score_threshold=None, ndocs=None, load_index_with_mmap=False, index_path=None, nbits=1, kmeans_niters=4, resume=False, similarity='cosine', bsize=32, accumsteps=1, lr=3e-06, maxsteps=500000, save_every=None, warmup=None, warmup_bert=None, relu=False, nway=2, use_ib_negatives=False, reranker=False, distillation_alpha=1.0, ignore_scores=False, model_name=None, query_maxlen=32, attend_to_mask_tokens=False, interaction='colbert', dim=128, doc_maxlen=220, mask_punctuation=True, checkpoint=None, triples=None, collection=None, queries=None, index_name=None, overwrite=False, root='/Users/bclavie/experiments', experiment='default', index_root=None, name='2023-12/30/12.20.33', rank=0, nranks=1, amp=True, gpus=0)

Where the config is all default values, and not the one of the model actually downloaded from the hub, which can be a fairly big problem (e.g. JaColBERT uses different query/doc_token)

An in-depth fix could be having a proper hugggingface Config class (having ColBERTConfig inherit from it) and fetching the config with AutoConfig before initialising the model, but that requires more modifications to the codebase.

I have implemented a quick fix in the library I'm working on, and upstreaming it here. When loading a pretrained config, we'll attempt to manually download the artifact.metadata file:

        try:
            checkpoint_path = hf_hub_download(
                repo_id=checkpoint_path, filename="artifact.metadata"
            ).split("artifact")[0]
        except RepositoryNotFoundError:
            pass

This is a completely transparent quick fix -- if you've passed a local path to the model loader it will just error out and not modify anything, but it ensures we'll properly load the config from the hub to make models easier to share (see output post-fix below)

In [1]: from colbert.modeling.base_colbert import BaseColBERT

In [2]: model = BaseColBERT("colbert-ir/colbertv2.0")
model.colbert_oc
In [3]: model.colbert_config
Out[3]: ColBERTConfig(query_token_id='[unused0]', doc_token_id='[unused1]', query_token='[Q]', doc_token='[D]', ncells=None, centroid_score_threshold=None, ndocs=None, load_index_with_mmap=False, index_path=None, nbits=1, kmeans_niters=20, resume=False, similarity='cosine', bsize=8, accumsteps=1, lr=1e-05, maxsteps=400000, save_every=None, warmup=20000, warmup_bert=None, relu=False, nway=64, use_ib_negatives=True, reranker=False, distillation_alpha=1.0, ignore_scores=False, model_name=None, query_maxlen=32, attend_to_mask_tokens=False, interaction='colbert', dim=128, doc_maxlen=180, mask_punctuation=True, checkpoint='/Users/bclavie/.cache/huggingface/hub/models--colbert-ir--colbertv2.0/snapshots/051f6791624c62edf834cf07edd10563ae17f579/', triples='/future/u/okhattab/root/unit/experiments/2021.10/downstream.distillation.round2.2_score/round2.nway6.cosine.ib/examples.64.json', collection='/future/u/okhattab/data/MSMARCO/collection.tsv', queries='/future/u/okhattab/data/MSMARCO/queries.train.tsv', index_name=None, overwrite=False, root='/future/u/okhattab/root/unit/experiments', experiment='2021.10', index_root=None, name='kldR2.nway64.ib', rank=0, nranks=4, amp=True, gpus=8)

(apologies for the noisy formatting commit -- thought I'd set up my formatter properly for the repo, but was keen to get this fix pushed ASAP)

okhat commented 9 months ago

Thanks so much @bclavie ! This makes a lot of sense. This code never considered downloading models from the hub.