Closed Hatins closed 1 year ago
Hi @Hatins This is because the logger is using the online wandb service which relies on wandb cloud api reaching their servers. Apparently it's not always so stable so this issues occur.
A workaround is to use the default wandb logger from pytorch lightning or any other logger. Or just wait until the issue disappears (which is what I usually do).
Hi @magehrig Luckily for me, you answered so quickly. As you said it's a really bad problem, I've been having this bug for 3 days now. And every time it will be interrupted in the process of running the code. If I use offline wandb mode and upload data when the network is good, will it be the same?
By the way, if you know how to set the offline, please tell me since I don't know which way is the most suitable (there are too many different ways on the Internet to set the wandb as offline...)
Now I did it by:
self._wandb_init = dict(
name=name,
project=project,
group=group,
id=wandb_id,
resume="allow",
save_code=True,
mode = 'offline'
)
However, I also get the error as:
File "/home/zht/python_project/RVT_OWOD_v1/loggers/wandb_logger.py", line 236, in _num_logged_artifact
public_run = self._get_public_run()
File "/home/zht/python_project/RVT_OWOD_v1/loggers/wandb_logger.py", line 230, in _get_public_run
runpath = experiment._entity + '/' + experiment._project + '/' + experiment._run_id
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
Best! Haitins
You can switch the logger I wrote to anyone other than Pytorch Lightning provides. You have to replace this line to initialize the new logger.
Hi @magehrig Thanks for your advice! I have known what your mean and I change that line you mention as
logger = WandbLogger(project="RVT_OWOD",group="version_1")
However, I still get an error: So maybe it still needs some revises in order to follow it-style, but I don't know where I need to modify, so I may need your help! And I was wondering if it might be possible for you to make some adjustments to your code, if it's not too much trouble, in order to achieve full functionality.
You can remove these asserts (those that enforce that type(logger) is WandbLogger) and try again. In general, I have not written the code having in mind that other loggers will be used so you have to improvise slightly here. Let me know how it goes.
I got it, I follow the instruction and removed these asserts, but I got an error:
File "/home/zht/Python_project/RVT_OWOD_v1/callbacks/detection.py", line 98, in on_validation_epoch_end_custom
logger.log_images(key='val/predictions',
AttributeError: 'WandbLogger' object has no attribute 'log_images'
So I remove that function again:
# logger.log_images(key='val/predictions',
# images=merged_img,
# caption=captions)
Now the code seems to be running smoothly, but I will need some time to verify if there will be any further errors. And I would like to know if there will be any negative impacts after I remove this piece of code.
Best! Hatins
The obvious consequence is that you are not logging the "merged_img" anymore. That should be fine if you don't want it to be logged. Because you exchanged the logger you will have to adapt the code slightly to reload from the checkpoint and resume the training.
Hi @magehrig I wanted to let you know that I've understood your guidance. As a result, the code is now running smoothly, and if needed, I am fully prepared to make any necessary revisions myself!
I extend my heartfelt appreciation for your invaluable assistance once more! Wishing you happiness every day!
I'm happy for your success, Hatins, unfortunately I have not managed to reproduce your solution. Is it possible to document the necessary changes for an offline run here, as it is quite time consuming to get there. I assume there are not many changes and it would be very nice if you would do that for the following users. Maybe the open issue of this topic might be the best place. To give back what you got.
Tank you magehrig for sharing your great work.
@vanAken Hi, vanAken, don't be worried, I,d like to help! The first step, you should import the default wandblogger in pytorch_lightning as:
from pytorch_lightning.loggers import WandbLogger
To make sure the wandb can identify your your count, you should assign the related parameters at the begin:
os.environ["WANDB_API_KEY"] = 'xxxxxxxxxxxxxxxxx'
os.environ["WANDB_MODE"] = "offline"
Then replace the code:
# logger = get_wandb_logger(config)
as
logger = WandbLogger(project=config.wandb.project_name,name='xxx', group=config.wandb.group_name)
Note these steps should be done in train.py.
After that, you should comment out some codes about the visualization, which were realized by @magehrig in callbacks/detection.py (line65-68 and line98-100)
# logger.log_images(key='train/predictions',
# images=merged_img,
# caption=captions,
# step=global_step)
# logger.log_images(key='val/predictions',
# images=merged_img,
# caption=captions)
After that, you can run the code in the offline mode, however, the function of visualization will be unusable.
Thanks to Hatins for your help. Now it's training. After 20h it has 100 000 iterations and the first epoch isn't finished yet.
As you mentioned above, the asserts need to be removed twice in callbacks/viz_base.py! Comment out line 91 and line 155
#assert isinstance(logger, WandbLogger)
Thanks to Hatins, you are doing a great Job at the UZH.
Hi @magehrig I meet a new problem when using wandb, which may be caused by an error in the network. This error includes such information:
You know, I have successfully run your code a long time ago, but this error is only recently. I know the source of the problem comes from network issues, but I was wondering if you have some solution to deal with this problem?
Best! Haitins