sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
463 stars 65 forks source link

Weird logic in T5? #122

Closed sacdallago closed 3 years ago

sacdallago commented 3 years ago

If I initialize the model with decoder=True, the decoder is not, in fact, on: https://github.com/sacdallago/bio_embeddings/blob/develop/bio_embeddings/embed/prottrans_t5_embedder.py#L42 .

I somehow expect the opposite to happen: I set the decoder to false, thus the decoder won't be loaded. Default would be true.

sacdallago commented 3 years ago

P.S.: since I'm working on something bigger, I can include this in my next PR if you agree that this is weird and should rather be the other way around :)

mheinzinger commented 3 years ago

This is indeed weird and I'm not sure whether I might be just missing something in @konstin 's logic but the short version is:

When using the decoder, the model call looks as follows: with torch.no_grad(): embedding_repr = model(input_ids, attention_mask=attention_mask, decoder_input_ids=input_ids)

And the embedding extraction is decoder_emb = embedding_repr[0][batch_idx,:s_len] # first element of tuple returns Decoder tensor of shape [batch-size, seq_len, embedding_dimension]

sacdallago commented 3 years ago

Logging: seems like decoder: True is not a recognized option in the pipeline.

image

mheinzinger commented 3 years ago

Fair; I guess that's mostly based on a) slightly different embedding logic when using the decoder (=more work/complicated code structured) b) me telling @konstin that the Decoder is useless anyway (=potentially wasted time if no one uses the Decoder anyway)

sacdallago commented 3 years ago

Also, seems to be relevant:

2021-03-29 11:10:00,174 INFO Created the file test/prottrans_t5/reduced_embeddings_file.h5
 0%|          | 0/1 [00:00<?, ?it/s]  0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
 File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 280, in execute_pipeline_from_config
   stage_output_parameters = stage_runnable(**stage_parameters)
 File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 400, in run
   return embed_and_write_batched(embedder, file_manager, result_kwargs, kwargs.get("half_precision", False))
 File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 231, in embed_and_write_batched
   for sequence_id, original_id, embedding in zip(
 File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
   for obj in iterable:
 File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/embedder_interfaces.py", line 122, in embed_many
   yield self.embed(seq)
 File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/prottrans_embedder.py", line 121, in embed
   [embedding] = self.embed_batch([sequence])
 File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/embedder_interfaces.py", line 172, in embed_batch
   yield from self._embed_batch_impl(batch, self._model)
 File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/prottrans_embedder.py", line 85, in _embed_batch_impl
   embeddings = model(
 File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
   result = self.forward(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'decoder_input_ids'

Config used:

global:
  sequences_file: sequences.fasta
  prefix: test
prottrans_t5:
  type: embed
  protocol: prottrans_t5_bfd
  reduce: True
  discard_per_amino_acid_embeddings: True
  decoder: True
  model_directory: /mnt/project/bio_embeddings/models/lms/t5
sacdallago commented 3 years ago

@mheinzinger well, to me (and I guess I'm in the loop :) So imagine external users ) it's super confusing a mixed up logic, little understanding what's going on and why, and three options:

  1. ~decoder~ encoder only ?
  2. Half model ?
  3. Half precision ?

I'm upgrading this to bug, since the error above.

mheinzinger commented 3 years ago

Yup, that's why I said "slightly different embedding logic when using the decoder" ;) If you want to use the decoder (again: no idea why you should):

model = T5Model() with torch.no_grad(): embedding_repr = model(input_ids, attention_mask=attention_mask, decoder_input_ids=input_ids) decoder_emb = embedding_repr[0][batch_idx,:s_len] # first element of tuple returns Decoder tensor of shape [batch-size, seq_len, embedding_dimension]

Decoder-only? - Not existing. If you want to use the decoder, you always have to compute both sides (encoder first, then decoder). If you set Decoder=True, the result could be that you get only Decoder embeddings but internally you would always have to compute both sides. Half-model? - Instead of using the T5Model you can use the T5EncoderModel. That will only load the weights of the Encoder (lower memory-requirements, faster processing). Half precision? - loads the model weights in fp16 instead of fp32. This speeds up processing, lowers memory requirement (you can embed longer sequences)

Finally: I think it's easier for most users to just give them a single option (T5-EncoderModel with fp16) so I would not even give them any other option.

sacdallago commented 3 years ago

Finally: I think it's easier for most users to just give them a single option (T5-EncoderModel with fp16) so I would not even give them any other option.

On this one I disagree, since I want to make no assumptions on what users want.

Half-model? - Instead of using the T5Model you can use the T5EncoderModel. That will only load the weights of the Encoder (lower memory-requirements, faster processing).

Ah, here you are hitting the nail on the spot. Currently, setting decoder=True will use T5EncoderModel. So, you can see why I say this is extremely confusing. There's no semantic meaning to "half model": which half? Sure... You know, because you are a ML expert: what about a biologits?

I prefer either an option encoder_only=True by default or an option decoder=False default option, which both lead to the model only using the encoder. No "half models".

Half precision? - loads the model weights in fp16 instead of fp32. This speeds up processing, lowers memory requirement (you can embed longer sequences)

Your "half precision" is rather "half model" in @konstin eyes: https://github.com/sacdallago/bio_embeddings/blob/develop/bio_embeddings/embed/prottrans_t5_embedder.py#L51 . And I agree that this is the better nomenclature, in this case.

Half precision: for me this is external to the embedder, it's how the embeddings get stored. You can compute the embeddings in half precision (using "half model"), but store them in full precision (because you are a masochist, what do I know :D ).

sacdallago commented 3 years ago

Maybe "half model" for t5 is an ambiguous name in either case, given that it can be "half precision" or "only one half (decoder/encoder) of the model". Maybe a better name for what we are trying to achieve in that case is half_precision_model. Opinions?

konstin commented 3 years ago

I'm afraid I messed up the options and decoder meant the inverse of what it should. Sorry for all the confusion this caused.

I've made a merge request to fix that and ensure that decoder=False actually means only encoder, along with some other stuff that came up when fixing that.

I have manually confirmed that both ProtTransT5UniRef50Embedder(decoder=False).embed("PROTEIN") and ProtTransT5UniRef50Embedder(decoder=True).embed("PROTEIN") work and give different results. The output of ProtTransT5UniRef50Embedder().embed("PROTEIN") remains unchanged (the output of the encoder), though we now don't load or compute the decoder anymore.


P.S.: since I'm working on something bigger, I can include this in my next PR if you agree that this is weird and should rather be the other way around :)

I fixed this separately now since I also had to fix the fallback model and a very verbose warning. I hope this doesn't cause any merge conflicts.

Logging: seems like decoder: True is not a recognized option in the pipeline.

TypeError: forward() got an unexpected keyword argument 'decoder_input_ids'

Sorry, both fixed in the merge request

Half-model? - Instead of using the T5Model you can use the T5EncoderModel. That will only load the weights of the Encoder (lower memory-requirements, faster processing).

No, half model is model.half(), i.e. use fp16 to save memory!

Maybe "half model" for t5 is an ambiguous name in either case, given that it can be "half precision" or "only one half (decoder/encoder) of the model". Maybe a better name for what we are trying to achieve in that case is half_precision_model. Opinions?

I see how half_precision_model would have avoided Michael's confusion. However in general I expect that no matter what the name we choose, it will always be confusing to users that there are 2 slightly different ways to get fp16 embeddings, and I think that can only be tackled with verbose docs.

I tried to draw a picture that better explains what all those half options do. I think we should add something similar to the docs:

 ----------------  convert to number   ------------------               ---------
| sequence input | ---------------->  | embedder weights | ----------> | h5 file |
 ----------------                      ------------------               ---------

default:

 ----------------   convert to fp32    ------------------               ---------
|     fasta      | ---------------->  |   fp32           | ----------> |   fp32  |
 ----------------                      ------------------               ---------

half_precision:

 ----------------   convert to fp32    ------------------  "rounding"   ---------
|     fasta      | ---------------->  |   fp32           | ----------> |   fp16  |
 ----------------                      ------------------               ---------

half_model + half_precision (use this with T5):

 ----------------   convert to fp16    ------------------               ---------
|     fasta      | ---------------->  |   fp16           | ----------> |   fp16  |
 ----------------                      ------------------               ---------

half_model (don't use this without half_precision, it doesn't make sense):

 ----------------   convert to fp16    ------------------  "upscaling"  ---------
|     fasta      | ---------------->  |   fp16           | ----------> |   fp32  |
 ----------------                      ------------------               ---------
sacdallago commented 3 years ago

I'm thinking of a way we can include this picture somewhere...

sacdallago commented 3 years ago

Solved. Maybe can add the ASCII fig above in the doc rewamp #125