instadeepai / nucleotide-transformer

đŸ§¬ Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
https://www.biorxiv.org/content/10.1101/2023.01.11.523679v2
Other
481 stars 55 forks source link

Issues with embedding with 2B5 models #4

Closed frederikkemarin closed 1 year ago

frederikkemarin commented 1 year ago

Hello, I am trying to use the provided code to embed a sequence (I am simple copy-pasting the code from the Readme) With the 2 smaller models I have no issue. However, both of the 2B5 models throw the following errors (2 different errors) 2B5_1000G:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
Input In [31], in <cell line: 7>()
      4 from nucleotide_transformer.pretrained import get_pretrained_model
      6 # Get pretrained model
----> 7 parameters, forward_fn, tokenizer, config = get_pretrained_model(
      8     model_name="2B5_1000G",
      9     mixed_precision=False,
     10     embeddings_layers_to_save=(20,),
     11     max_positions=32,
     12 )
     13 forward_fn = hk.transform(forward_fn)
     15 # Get data and tokenize it

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/nucleotide_transformer/pretrained.py:197, in get_pretrained_model(model_name, mixed_precision, embeddings_layers_to_save, attention_maps_to_save, max_positions)
    192     raise NotImplementedError(
    193         f"Unknown {model_name} model. " f"Supported models are {supported_models}"
    194     )
    196 # Download weights and hyperparams
--> 197 parameters, hyperparams = download_ckpt_and_hyperparams(model_name)
    199 tokenizer = FixedSizeNucleotidesKmersTokenizer(
    200     k_mers=hyperparams["k_for_kmers"],
    201     fixed_length=max_positions,
    202     prepend_cls_token=True,
    203 )
    205 # Get config

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/nucleotide_transformer/pretrained.py:99, in download_ckpt_and_hyperparams(model_name)
     96         hyperparams = json.load(f)
     98     with open(params_save_dir, "rb") as f:
---> 99         params = joblib.load(f)
    101     return params, hyperparams
    103 else:

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/numpy_pickle.py:648, in load(filename, mmap_mode)
    646     filename = getattr(fobj, 'name', '')
    647     with _read_fileobject(fobj, filename, mmap_mode) as fobj:
--> 648         obj = _unpickle(fobj)
    649 else:
    650     with open(filename, 'rb') as f:

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/numpy_pickle.py:577, in _unpickle(fobj, filename, mmap_mode)
    575 obj = None
    576 try:
--> 577     obj = unpickler.load()
    578     if unpickler.compat_mode:
    579         warnings.warn("The file '%s' has been generated with a "
    580                       "joblib version less than 0.10. "
    581                       "Please regenerate this pickle file."
    582                       % filename,
    583                       DeprecationWarning, stacklevel=3)

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/pickle.py:1213, in _Unpickler.load(self)
   1211             raise EOFError
   1212         assert isinstance(key, bytes_types)
-> 1213         dispatch[key[0]](self)
   1214 except _Stop as stopinst:
   1215     return stopinst.value

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/numpy_pickle.py:415, in NumpyUnpickler.load_build(self)
    413 if isinstance(array_wrapper, NDArrayWrapper):
    414     self.compat_mode = True
--> 415 self.stack.append(array_wrapper.read(self))

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/numpy_pickle.py:252, in NumpyArrayWrapper.read(self, unpickler)
    250     array = self.read_mmap(unpickler)
    251 else:
--> 252     array = self.read_array(unpickler)
    254 # Manage array subclass case
    255 if (hasattr(array, '__array_prepare__') and
    256     self.subclass not in (unpickler.np.ndarray,
    257                           unpickler.np.memmap)):
    258     # We need to reconstruct another subclass

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/numpy_pickle.py:177, in NumpyArrayWrapper.read_array(self, unpickler)
    175 read_count = min(max_read_count, count - i)
    176 read_size = int(read_count * self.dtype.itemsize)
--> 177 data = _read_bytes(unpickler.file_handle,
    178                    read_size, "array data")
    179 array[i:i + read_count] = \
    180     unpickler.np.frombuffer(data, dtype=self.dtype,
    181                             count=read_count)
    182 del data

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/numpy_pickle_utils.py:243, in _read_bytes(fp, size, error_template)
    238 while True:
    239     # io files (default in python3) return None or raise on
    240     # would-block, python2 file will truncate, probably nothing can be
    241     # done about that.  note that regular files can't be non-blocking
    242     try:
--> 243         r = fp.read(size - len(data))
    244         data += r
    245         if len(r) == 0 or len(data) == size:

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/compressor.py:464, in BinaryZlibFile.readinto(self, b)
    459 """Read up to len(b) bytes into b.
    460 
    461 Returns the number of bytes read (0 for EOF).
    462 """
    463 with self._lock:
--> 464     return io.BufferedIOBase.readinto(self, b)

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/compressor.py:456, in BinaryZlibFile.read(self, size)
    454     return self._read_all()
    455 else:
--> 456     return self._read_block(size)

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/compressor.py:429, in BinaryZlibFile._read_block(self, n_bytes, return_data)
    426 self._buffer_offset = 0
    428 blocks = []
--> 429 while n_bytes > 0 and self._fill_buffer():
    430     if n_bytes < len(self._buffer):
    431         data = self._buffer[:n_bytes]

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/compressor.py:393, in BinaryZlibFile._fill_buffer(self)
    391         return False
    392     else:
--> 393         self._buffer = self._decompressor.decompress(rawblock)
    394     self._buffer_offset = 0
    395 return True

error: Error -3 while decompressing data: invalid code lengths set

2B5_multi_species:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [30], in <cell line: 7>()
      4 from nucleotide_transformer.pretrained import get_pretrained_model
      6 # Get pretrained model
----> 7 parameters, forward_fn, tokenizer, config = get_pretrained_model(
      8     model_name="2B5_multi_species",
      9     mixed_precision=False,
     10     embeddings_layers_to_save=(20,),
     11     max_positions=32,
     12 )
     13 forward_fn = hk.transform(forward_fn)
     15 # Get data and tokenize it

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/nucleotide_transformer/pretrained.py:197, in get_pretrained_model(model_name, mixed_precision, embeddings_layers_to_save, attention_maps_to_save, max_positions)
    192     raise NotImplementedError(
    193         f"Unknown {model_name} model. " f"Supported models are {supported_models}"
    194     )
    196 # Download weights and hyperparams
--> 197 parameters, hyperparams = download_ckpt_and_hyperparams(model_name)
    199 tokenizer = FixedSizeNucleotidesKmersTokenizer(
    200     k_mers=hyperparams["k_for_kmers"],
    201     fixed_length=max_positions,
    202     prepend_cls_token=True,
    203 )
    205 # Get config

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/nucleotide_transformer/pretrained.py:99, in download_ckpt_and_hyperparams(model_name)
     96         hyperparams = json.load(f)
     98     with open(params_save_dir, "rb") as f:
---> 99         params = joblib.load(f)
    101     return params, hyperparams
    103 else:

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/numpy_pickle.py:648, in load(filename, mmap_mode)
    646     filename = getattr(fobj, 'name', '')
    647     with _read_fileobject(fobj, filename, mmap_mode) as fobj:
--> 648         obj = _unpickle(fobj)
    649 else:
    650     with open(filename, 'rb') as f:

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/site-packages/joblib/numpy_pickle.py:577, in _unpickle(fobj, filename, mmap_mode)
    575 obj = None
    576 try:
--> 577     obj = unpickler.load()
    578     if unpickler.compat_mode:
    579         warnings.warn("The file '%s' has been generated with a "
    580                       "joblib version less than 0.10. "
    581                       "Please regenerate this pickle file."
    582                       % filename,
    583                       DeprecationWarning, stacklevel=3)

File ~/lib/software/miniconda3/envs/torch-p310/lib/python3.10/pickle.py:1213, in _Unpickler.load(self)
   1211             raise EOFError
   1212         assert isinstance(key, bytes_types)
-> 1213         dispatch[key[0]](self)
   1214 except _Stop as stopinst:
   1215     return stopinst.value

KeyError: 188
adsodemelk commented 1 year ago

Which Architecture are you running on @frederikkemarin ?

frederikkemarin commented 1 year ago

I've tried both 2B5 models on GPU.

adsodemelk commented 1 year ago

I was curious to know if you are using Windows, mac or Linux? In case is a library version issue. Also I suggest you try to delete the cache where models are stored (for example root/.cache/nucleotide_transformer/2B5_multi_species/ and download it again incase the file is somehow corrupted.

frederikkemarin commented 1 year ago

Ah okay. I am using linux. Deleting and redownloading resolved the issue.

ranzenTom commented 1 year ago

Great! Closing the issue as it has been resolved.