Closed NaufalRezkyA closed 1 year ago
Hi @GuillaumeLeclerc @andrewilyas , do you have any suggestion on how we should run FFCV on CPU node?
Thanks! Meng
Hi @mengwanguc ! We haven't tried running FFCV on a CPU-only node and it isn't officially supported - our team has very low bandwidth at the moment and won't be able to investigate, but if you make any headway we are happy to add documentation!
Hi @mengwanguc ! We haven't tried running FFCV on a CPU-only node and it isn't officially supported - our team has very low bandwidth at the moment and won't be able to investigate, but if you make any headway we are happy to add documentation!
Hi @andrewilyas , thanks for the reply!
I'm opening this issue again as I have some follow-up questions:
Is there anything in FFCV that would conceptually prevent us using FFCV on CPU-only nodes? (e.g. some optimization/functionality/code that is deeply coupled with CUDA/GPU to compile/install/run).
Or do you think it is conceptually doable, but would take time to figure out the correct environments?
looks like I cannot reopen this issue, so I'm opening another new issue for this question: https://github.com/libffcv/ffcv/issues/359
Beforehand, I was able to run FFCV on imagenet dataset smoothly using GPU. but I want to try to run it using only the CPU. By changing torch.device to cpu. But this causes stuck (loop forever).
here is my loader:
I tried to trace it and found that the stuck occurred in the graph.py file in the group_operation() function. The stuck happens because when we enter Normalize the operation state will be replaced and jit_mode will be changed to True. because node.is_jitted = True and jitted_stage = False, the jitted condition cannot be performed which causes a forever loop.
Here is group_operation from Graph class in ffcv/pipeline/graph.py
here is function code from ffcv/transforms/normalize.py where they assign jit_mode to True if we are using CPU.
and these are the output when it got stuck:
Any suggestions so I can use CPU in this case?