cannot reproduce the RichpediaMEL results?

zhiweihu1103 commented 1 year ago

Hi, Pengfei. Nice work. I find I cannot reproduce the RichpediaMEL dataset result,, I use the same yaml as you provided, can you help me? attachment is the training logs. richpediamel.txt

pengfei-luo commented 1 year ago

Hi, Zhiwei. I retrained the model with RichpediaMEL dataset, and everything seems fine. Based on the training logs you provided, I notice that the loss appears to be much larger than usual. In my training, after the first epoch, the Train/loss_epoch is around 3.19.

Train/loss_step	step	Train/loss_epoch
2.700	29
2.965	59
2.596	89
	97	3.187

Is the issue of not being able to reproduce the results limited to RichpediaMEL, or does it apply to all datasets?

zhiweihu1103 commented 1 year ago

Hi, Pengfei. Only limited to RichpediaMEL, the other two datasets can get results close to the original text.

zhiweihu1103 commented 1 year ago

In addition, I see that many attr fields in the dataset are empty. Is this field not used in the end?

pengfei-luo commented 1 year ago

Hi, Pengfei. Only limited to RichpediaMEL, the other two datasets can get results close to the original text.

That's strange. I've checked the MD5 of the files, and they appear to match the ones on my training server. Can you please check the learning rate during training? It seems that after the second epoch, the loss no longer exhibits significant changes.

In addition, I see that many attr fields in the dataset are empty. Is this field not used in the end?

For some entities, I couldn't retrieve suitable attributes from Wikidata (possibly due to a network issue), so I left them blank. In the implementation, the attributes are concatenated with the entity's name.

https://github.com/pengfei-luo/MIMIC/blob/59ef385c14c5bffd70eaf8012f876850f6b99072/codes/utils/dataset.py#L55-L56

zhiweihu1103 commented 1 year ago

I need to print the learning rate after each round, right? I also found that the losses did not change much after the second round.

zhiweihu1103 commented 1 year ago

Okay, that means attr is not used in the current dataset, right?

pengfei-luo commented 1 year ago

I need to print the learning rate after each round, right? I also found that the losses did not change much after the second round.

You could log the learning rate without hassle by using PyTorch Lightning callbacks. You simply need to add it to the trainer's callbacks.

import os
import torch
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
from codes.utils.functions import setup_parser
from codes.model.lightning_mimic import LightningForMIMIC
from codes.utils.dataset import DataModuleForMIMIC

if __name__ == '__main__':
    args = setup_parser()
    pl.seed_everything(args.seed, workers=True)
    torch.set_num_threads(1)

    data_module = DataModuleForMIMIC(args)
    lightning_model = LightningForMIMIC(args)

    logger = pl.loggers.CSVLogger("./runs", name=args.run_name, flush_logs_every_n_steps=30)

    ckpt_callbacks = ModelCheckpoint(monitor='Val/mrr', save_weights_only=True, mode='max')
    early_stop_callback = EarlyStopping(monitor="Val/mrr", min_delta=0.00, patience=3, verbose=True, mode="max")
    lr_callback = LearningRateMonitor(logging_interval='step')

    trainer = pl.Trainer(**args.trainer,
                         deterministic=True, logger=logger, default_root_dir="./runs",
                         callbacks=[ckpt_callbacks, early_stop_callback, lr_callback])

    trainer.fit(lightning_model, datamodule=data_module)
    trainer.test(lightning_model, datamodule=data_module, ckpt_path='best')

Okay, that means attr is not used in the current dataset, right?

I'm not sure what you mean by "not used." Our intention is to utilize the attributes to enhance the representation of entities. Therefore, we concatenate the flattened key-value attributes with the entity's name as textual input.

zhiweihu1103 commented 1 year ago

I will give feedback this afternoon or evening.

zhiweihu1103 commented 1 year ago

What I mean is that I saw that the attr field is empty, indicating that attr is not used. In the code, I saw that there is indeed a part where attr is spliced.

pengfei-luo commented 1 year ago

What I mean is that I saw that the attr field is empty, indicating that attr is not used. In the code, I saw that there is indeed a part where attr is spliced.

No, I did use attributes. However, due to network issues or the absence of suitable attributes, some entities have empty or missing attr field.

zhiweihu1103 commented 1 year ago

Ok, I understand.

zhiweihu1103 commented 1 year ago

In addition, would you mind provide the Figure 4 datasets (10% and 20% for RichpediaMEL and WikiDiverse), and the numerical results, I need to draw my own histogram, but I don't know the specific value of your histogram.

zhiweihu1103 commented 1 year ago

Hi Pengfei. I have provided the training logs with learning rate logs. Please also pay attention to the question I mentioned above about the dataset and numerical results of Figure 4, looking forward to the discussion. richpediamel_lr.txt metrics.csv

zhiweihu1103 commented 1 year ago

Hi, Pengfei. Anything update?

pengfei-luo commented 1 year ago

Hi, Pengfei. Anything update?

Hi, sorry for the late response. I have reviewed your log file, and the learning rate appears to be fine. I attempted to retrain the model using the code and original data we uploaded, and the loss and evaluation results match our reported findings. Could you please check the configuration file config/richpediamel.yaml to see if there is anything wrong? Could you also provide details about the environment you used to train the model?

If you want to reproduce the reported results right now, I have uploaded a model checkpoint here (password: KDD2023richpedia).

In the low-resource setting, we only utilized the first 10% and 20% of the training data for each dataset, following the order in the training data file. This means that if you want to access the low-resource training data, you only need to control the amount of training data used.

Please add a new line after https://github.com/pengfei-luo/MIMIC/blob/59ef385c14c5bffd70eaf8012f876850f6b99072/codes/utils/dataset.py#L44

train_data = train_data[:int(len(train_data) * 0.1)]  # or 0.2

Then you can obtain either 10% or 20% of the training data we used.

Regarding the numerical results you've requested, I will update them in the readme file in the next few days. Please stay tuned.

zhiweihu1103 commented 1 year ago

Hi Pengfei. First, I uploaded the yaml file information I used, and I did not make any modifications except the path; secondly, for the running environment, I created it through conda alone, and the environment information is exactly the same as your requirements.txt.

run_name: RichpediaMEL
seed: 43
pretrained_model: '/checkpoint/clip-vit-base-patch32'
lr: 1e-5

data:
  num_entity: 160933
  kb_img_folder: /data/RichpediaMEL/kb_image
  mention_img_folder: /data/RichpediaMEL/mention_image
  qid2id: /data/RichpediaMEL/qid2id.json
  entity: /data/RichpediaMEL/kb_entity.json
  train_file: /data/RichpediaMEL/RichpediaMEL_train.json
  dev_file: /data/RichpediaMEL/RichpediaMEL_dev.json
  test_file: /data/RichpediaMEL/RichpediaMEL_test.json

  batch_size: 128
  num_workers: 8
  text_max_length: 40

  eval_chunk_size: 6000
  eval_batch_size: 20
  embed_update_batch_size: 512

model:
  input_hidden_dim: 512
  input_image_hidden_dim: 768
  hidden_dim: 96
  dv: 96
  dt: 512
  TGLU_hidden_dim: 96
  IDLU_hidden_dim: 96
  CMFU_hidden_dim: 96

trainer:
  accelerator: 'gpu'
  devices: 1
  max_epochs: 20
  num_sanity_val_steps: 0
  check_val_every_n_epoch: 2
  log_every_n_steps: 30

All environmental information is:

absl-py                 1.4.0
aiohttp                 3.8.5
aiosignal               1.3.1
antlr4-python3-runtime  4.9.3
async-timeout           4.0.3
attrs                   23.1.0
cachetools              5.3.1
certifi                 2023.7.22
charset-normalizer      3.2.0
click                   8.1.7
filelock                3.12.3
frozenlist              1.4.0
fsspec                  2023.9.0
google-auth             2.22.0
google-auth-oauthlib    1.0.0
grpcio                  1.57.0
huggingface-hub         0.16.4
idna                    3.4
importlib-metadata      6.8.0
joblib                  1.3.2
Markdown                3.4.4
MarkupSafe              2.1.3
multidict               6.0.4
numpy                   1.24.4
oauthlib                3.2.2
omegaconf               2.2.3
packaging               23.1
Pillow                  9.3.0
pip                     23.2.1
protobuf                4.24.2
pyasn1                  0.5.0
pyasn1-modules          0.3.0
pyDeprecate             0.3.2
pytorch-lightning       1.7.7
PyYAML                  6.0.1
regex                   2023.8.8
requests                2.31.0
requests-oauthlib       1.3.1
rsa                     4.9
sacremoses              0.0.53
setuptools              68.0.0
six                     1.16.0
tensorboard             2.14.0
tensorboard-data-server 0.7.1
tokenizers              0.12.1
torch                   1.11.0
torchmetrics            0.11.0
tqdm                    4.66.1
transformers            4.18.0
typing_extensions       4.7.1
urllib3                 1.26.16
Werkzeug                2.3.7
wheel                   0.38.4
yarl                    1.9.2
zipp                    3.16.2

zhiweihu1103 commented 1 year ago

Thanks for information about how to run the low-resource setting experiments. I am very much looking forward to your numerical results, thank you for your efforts. In addition, regarding the reproduction of dataset RichpediaMEL, I think whether there may be some differences between the code you reproduced and the code uploaded, because I ran it twice on this dataset and the results were exactly the same as I above upload.

pengfei-luo commented 1 year ago

I can reproduce the results with the code we shared and the data we uploaded to OneDrive. Is there anything difference about the pretrained model? I saw you change the path. I use the one form huggingface.

SHA256: a63082132ba4f97a80bea76823f544493bffa8082296d62d71581a4feff1576f MD5: 47767ea81d24718fcc0c8923607792a7

zhiweihu1103 commented 1 year ago

I download the pretrained clip from https://huggingface.co/openai/clip-vit-base-patch32/tree/main, I will replace the pytorch_model.bin with the link you provided, upload the results tomorrow morning.

zhiweihu1103 commented 1 year ago

But I found that the CLIP weighted link address I downloaded actually came out exactly the same as the one you provided after clicking pytorch_model.bin.

zhiweihu1103 commented 1 year ago

Hi Pengfei. I may need further help from you, because I still have difficulty reproducing the results of dataset RichpediaMEL, even though I have used the CLIP pre-training URL you gave me (actually the same pre-trained model I used previous), I will upload it below my running logs on three datasets. wikidiverse_another.txt wikimel_another.txt richpediamel_another.txt

pengfei-luo commented 1 year ago

This is very strange. The other two datasets work fine, only RichpediaMEL has an issue. Maybe you could double-check the RichpediaMEL.tar file you downloaded? I will share an online Wandb report later to show that everything is normal on my end.

RichpediaMEL.tar MD5: 0f499eddde7582428947e45ebb94388f SHA256: 36ac5703e4a9890238daedf039a7b2923a7c4b66c66a6b9cf788db40eabe0447

zhiweihu1103 commented 1 year ago

I will take a screenshot to share the information after decompressing the RichpediaMEL dataset. the kb_image has 96073 files, and mention_images has 15852 files.

zhiweihu1103 commented 1 year ago

I download the RichpediaMEL dataset from you provided: https://mailustceducn-my.sharepoint.com/:u:/g/personal/pfluo_mail_ustc_edu_cn/ERikbOQuoWFHrA_AizcuCbgB8PBOiRqCV4U0lZfxUN-6kg?e=speIdh

pengfei-luo commented 1 year ago

Could you please try upgrading transformers to version 4.27.1? I notice that the version of transformers might have an impact on the results, although I'm not sure what's causing the differences in results.

pip install transformers==4.27.1 --upgrade

zhiweihu1103 commented 1 year ago

Let me check.

pengfei-luo commented 1 year ago

The Wandb report is here.

zhiweihu1103 commented 1 year ago

You use the transformers==4.27.1 right?

pengfei-luo commented 1 year ago

Yes, in the Wandb report run, I used torch==1.11.0 and transformers==4.27.1. Other packages are the same as the requirements. I attempted to downgrade transformers to 4.18.0 and noticed that it did lead to a performance drop. I have no idea why this occurred.

zhiweihu1103 commented 1 year ago

If the performance degradation is due to transformers, then this should not be within the scope of our discussion. As long as the results can be reproduced, everything is good. I'll re-run and give my reproduction results.

zhiweihu1103 commented 1 year ago

Hi, Pengfei. I think it is still difficult for me to reproduce the results of dataset RichpediaMEL. The following table is a comparison of the results using different versions of transformers. I also uploaded the training log of transformers==4.27.1. richpediamel_new_transformers.txt I also upload the metrics.csv. metrics.csv I compared my train_loss_epoch on the RichpediaMEL dataset and the train_loss_epoch you provided on Wandb, and found that there is a huge difference. Is your Wandb training log directly running the code in your open source code repository?

pengfei-luo commented 1 year ago

Just replace the CSV logger with the Wandb logger to enable Wandb logging.

logger = pl.loggers.WandbLogger(project='MIMIC', name=args.run_name)

zhiweihu1103 commented 1 year ago

No, what I mean is, was your result Wandb run on your current open source code? Because the problem now is that the RichpediaMEL dataset cannot be reproduced.

pengfei-luo commented 1 year ago

Yes, I cloned form github and only modified a few lines regarding logging. You can check the information and the code of this run here (codes form the left bar).

zhiweihu1103 commented 1 year ago

It's amazing, I can't imagine why it's so hard to reproduce.

zhiweihu1103 commented 1 year ago

Hi, Pengfei. Firstly, I carefully compared the open source github code with the code you used on wandb. There are only some differences when the CLIP model executes the from_pretrained method. The open source code on github is:

self.tokenizer = CLIPProcessor.from_pretrained(self.args.pretrained_model).tokenizer

The code used by wandb is:

self.tokenizer = CLIPProcessor.from_pretrained(self.args.pretrained_model, local_files_only=True).tokenizer

But I think this is not the main problem, because after I added the local_files_only=True parameter, I found the result was the same.

Then, I created a requirements.txt environment that is exactly the same as wandb provided, and the running results are exactly the same as mine before, indicating that the difference in results is not caused by environmental problems.

So, I need to confirm now, is the RichpediaMEL dataset you are using the version you uploaded? Because now all the code and environment information are completely consistent, the performance difference is difficult to accept.

zhiweihu1103 commented 1 year ago

I will take a screenshot to share the information after decompressing the RichpediaMEL dataset. the kb_image has 96073 files, and mention_images has 15852 files.

Here are the statistics for the RichpediaMEL dataset I used.

pengfei-luo commented 1 year ago

Maybe you can check if the MD5 values of all the files match mine?

ba086b054bf52d549f2a79503c76704a  kb_entity.json
8059b7aa89a9314d5dc38607a8685eeb  qid2id.json
831cdd92d70a93ea8a442798ec2fcde1  RichpediaMEL_dev.json
9e07e5e970e01079d256311e5ac10bd8  RichpediaMEL_test.json
e1d0b2adb2a1114cefa63860ffa23982  RichpediaMEL_train.json
961efc263bc8e2e7b257a28e8e703633  kb_image.zip
474c594ce8a95aa5dc9222365db0044e  mention_images.zip

pengfei-luo commented 1 year ago

The parameter local_files_only=True ensures that local files are used, and we have already confirmed that the model weights are consistent. I think this won't have any impact.

zhiweihu1103 commented 1 year ago

You can ignore the .pkl files, I found a difference between kb_image and mention_images.

pengfei-luo commented 1 year ago

Can you provide the MD5 values for kb_image.zip and mention_images.zip? I directly extracted these two ZIP files.

zhiweihu1103 commented 1 year ago

Wait a few minutes, I deleted the original file after decompressing it, and I need to download it again.

zhiweihu1103 commented 1 year ago

I can't think of any other reason why it is difficult to reproduce, because the size of the .zip file is the same, but the size after decompression is different?

zhiweihu1103 commented 1 year ago

I checked your running log on wandb , and your loss is obviously much lower than what I reproduced.

pengfei-luo commented 1 year ago

It seems all the files are normal. The difference in folder sizes may be due to differences in how the operating system organizes files.

pengfei-luo commented 1 year ago

Perhaps you can try changing some hyperparameters, such as the random seed, learning rate, and batch size, to see if they have an impact on the loss. If you have access to other servers, maybe you can try configuring the environment and running it on other servers. I don't know what's causing the inability to reproduce the results. All the results on my end are normal.

zhiweihu1103 commented 1 year ago

I can try it on other machines, but judging from my experience running your code, as long as the random seed is fixed, the results will be exactly the same every time.

zhiweihu1103 commented 1 year ago

I think it is necessary to give some new content. I originally ran the code on the V100 32G graphics card. Now I have tried it on the A6000 and found that the final result of the model is almost the same as that of the V100. Have you made any other modifications? Because the hyperparameters I use are completely consistent with the yaml you provided.

pengfei-luo / MIMIC

cannot reproduce the RichpediaMEL results? #2