Offline use - Githubissues

mehdidc commented 1 year ago

It would be nice to be able to avoid the need of an internet connection as it is usually blocked in supercomputers. Two things I observed that needs internet connection:

wandb, can be solved by putting mode to offline in wandb.init, so that could be an option in the yaml config file (by default would be False)

loading HF pre-trained models, that could be solved by using environment variables export TRANSFORMERS_CACHE=<cache_folder and export TRANSFORMERS_OFFLINE=1, so this just needs to be documented. In my case, even after using those, I still got the following:

[1] 2023-04-23 06:19:21 WARNING  'HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /CompVis/stable-diffusion-v1-4/resolve/main/scheduler/scheduler_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x14806d3bfbe0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))' thrown while requesting HEAD https://huggingface.co/CompVis/stable-diffusion-v1-4/resolve/main/scheduler/scheduler_config.json

which happens at each validation/generation step, although it did not have any consequence, the job did not hang or anything, but I did not find a way to get rid of the message.

vkramanuj commented 1 year ago

Thanks for pointing this out. I believe this is due to this line https://github.com/mlfoundations/open-diffusion/blob/a06305ae548e1dab7a30cc11862632b255a06e6d/train.py#L642. In it, we reconstruct the scheduler for the save pipeline using the huggingface link, which causes the code to look on HF for its config. You might be able to fix it if you point it to a local copy (e.g. config.model.pretrained or config.model.scheduler.pretrained, whichever exists). Let me know if this fixes the issue!

mehdidc commented 1 year ago

Ah I see, thanks, will give it a try

mehdidc commented 1 year ago

@vkramanuj I was thinking, can't we just pass the noise_scheduler (https://github.com/mlfoundations/open-diffusion/blob/main/train.py#L237) that is already instantiated to the save function and use it ? worked fine for me, but I want to be sure that it won't cause an issue. If it is fine, I can open a PR.

vkramanuj commented 1 year ago

@mehdidc One concern here is that people generally use different schedulers at train and inference time. Certain schedulers (e.g. PNDM https://arxiv.org/pdf/2202.09778.pdf and Euler Discrete https://arxiv.org/abs/2206.00364) promise better generations with fewer timesteps. Unclear if this is important for basic validation though, but this is how people generate in practice.

mehdidc commented 1 year ago

Oh I see, makes sense then, thanks for clarifying. Ok so the scheduler for inference could be potentially parametrized in the config as well eventually, safety checker as well.

mehdidc commented 1 year ago

So I think, to avoid the confusion in the config file, I would say we should separate the two, e.g. training_scheduler and inference_scheduler

vkramanuj commented 1 year ago

That sounds good to me. Seems like this solves the issue, I will submit a PR to that effect (although feel free to submit a PR if you've already done this in a local copy).

mehdidc commented 1 year ago

Cool, I do have a local branch, let me quickly finish it

mlfoundations / open-diffusion

Offline use #2