megagonlabs / ditto

Code for the paper "Deep Entity Matching with Pre-trained Language Models"
Apache License 2.0
262 stars 89 forks source link

Error when using --summarize with matcher.py #15

Open s-waitz opened 3 years ago

s-waitz commented 3 years ago

Hi,

in your readme it says that the --summarize flag needs to be specified for matcher.py if it was also specified at training time. When I do so I get the following error:

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "matcher.py", line 242, in <module>
    dk_injector=dk_injector)
  File "matcher.py", line 149, in predict
    pairs.append((to_str(row[0], summarizer, max_len, dk_injector),
  File "matcher.py", line 49, in to_str
    content = summarizer.transform(content, max_len=max_len)
  File "/content/drive/My Drive/Master Thesis/ditto/repo/ditto/ditto/summarize.py", line 75, in transform
    sentA, sentB, label = row.strip().split('\t')
ValueError: not enough values to unpack (expected 3, got 1)

Without the --summarize flag matcher.py is running fine.

Is there any workaround to use matcher.py with summarization?

Ribo-Py commented 3 years ago

I have encountered similar issue.

`Downloading: 100%|███████████████████████████| 28.0/28.0 [00:00<00:00, 21.7kB/s] Downloading: 100%|██████████████████████████████| 483/483 [00:00<00:00, 401kB/s] Downloading: 100%|███████████████████████████| 226k/226k [00:00<00:00, 38.4MB/s] Downloading: 100%|███████████████████████████| 455k/455k [00:00<00:00, 40.8MB/s] Downloading: 100%|███████████████████████████| 256M/256M [00:05<00:00, 49.4MB/s] Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight']

Defaults for this optimization level are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",) /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/apex/amp/_initialize.py:25: UserWarning: An input tensor was not cuda. warnings.warn("An input tensor was not cuda.") Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) step: 0, loss: 0.7943093180656433 Traceback (most recent call last): File "train_ditto.py", line 92, in run_tag, hp) File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/ditto.py", line 201, in train train_step(train_iter, model, optimizer, scheduler, hp) File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/ditto.py", line 123, in train_step for i, batch in enumerate(train_iter): File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/dataset.py", line 80, in getitem left, right = combined.split(' [SEP] ') ValueError: not enough values to unpack (expected 2, got 1) Traceback (most recent call last): File "matcher.py", line 7, in import jsonlines ModuleNotFoundError: No module named 'jsonlines' `