jinhojsk515 / spmm

Multimodal learning for chemical domain, with SMILES and properties.
Apache License 2.0
25 stars 4 forks source link

The tokenizer failed to tokenize a smiles #2

Open zhaisilong opened 2 months ago

zhaisilong commented 2 months ago

SPMM_models.py

print(text_input.input_ids[:4], prop[:4], text_input.input_ids.shape) exit()

tensor([[2, 2, 1, 3], [2, 2, 1, 3], [2, 2, 1, 3], [2, 2, 1, 3]], device='cuda:0') tensor([[ 2.3104, -1.5346, -1.2431, -1.0015, -1.1336, -1.4145, -1.1286, -1.2634, -0.8863, -0.7488, -1.5606, -0.3628, -1.3415, -0.0982, -1.4446, 0.8066, 0.3274, -0.2114, 1.6937, 1.5113, -1.3633, -1.5179, -0.8140, -0.4515, 1.9547, -1.3326, -1.9357, -1.9357, -0.3264, 0.5493, 0.1936, -1.1924, -1.4456, -1.1348, -1.2925, -0.4183, -0.7510, -0.8534, -1.1132, -0.8680, -1.4712, -0.9739, -1.1667, -1.5777, -0.0343, 0.1643, -0.3701, -0.6282, -0.7153, -1.2057, -1.8758, -1.4532, -1.2087], [ 0.3362, -0.9400, -0.9879, -0.9637, -1.0960, -0.9825, -1.0188, -1.1606,...........

zhaisilong commented 2 months ago

from transformers import BertTokenizer, WordpieceTokenizer None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. tokenizer = BertTokenizer(vocab_file="vocab_bpe_300.txt", do_lower_case=False, do_basic_tokenize=False) tokenizer.wordpiece_tokenizer = WordpieceTokenizer(vocab=tokenizer.vocab, unk_token=tokenizer.unk_token, max_input_chars_per_word=250) tokenizer("CC(NC(=O)C(=O)NCCCCC#N)c1cccc(C(F)(F)F)c1") {'input_ids': [2, 1, 3], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}