Closed mehdidc closed 1 year ago
Thanks for pointing this out. I believe this is due to this line https://github.com/mlfoundations/open-diffusion/blob/a06305ae548e1dab7a30cc11862632b255a06e6d/train.py#L642. In it, we reconstruct the scheduler for the save pipeline using the huggingface link, which causes the code to look on HF for its config. You might be able to fix it if you point it to a local copy (e.g. config.model.pretrained or config.model.scheduler.pretrained, whichever exists). Let me know if this fixes the issue!
Ah I see, thanks, will give it a try
@vkramanuj I was thinking, can't we just pass the noise_scheduler (https://github.com/mlfoundations/open-diffusion/blob/main/train.py#L237) that is already instantiated to the save
function and use it ? worked fine for me, but I want to be sure that it won't cause an issue. If it is fine, I can open a PR.
@mehdidc One concern here is that people generally use different schedulers at train and inference time. Certain schedulers (e.g. PNDM https://arxiv.org/pdf/2202.09778.pdf and Euler Discrete https://arxiv.org/abs/2206.00364) promise better generations with fewer timesteps. Unclear if this is important for basic validation though, but this is how people generate in practice.
Oh I see, makes sense then, thanks for clarifying. Ok so the scheduler for inference could be potentially parametrized in the config as well eventually, safety checker as well.
So I think, to avoid the confusion in the config file, I would say we should separate the two, e.g. training_scheduler
and inference_scheduler
That sounds good to me. Seems like this solves the issue, I will submit a PR to that effect (although feel free to submit a PR if you've already done this in a local copy).
Cool, I do have a local branch, let me quickly finish it
It would be nice to be able to avoid the need of an internet connection as it is usually blocked in supercomputers. Two things I observed that needs internet connection:
wandb.init
, so that could be an option in the yaml config file (by default would be False)export TRANSFORMERS_CACHE=<cache_folder
andexport TRANSFORMERS_OFFLINE=1
, so this just needs to be documented. In my case, even after using those, I still got the following:which happens at each validation/generation step, although it did not have any consequence, the job did not hang or anything, but I did not find a way to get rid of the message.