libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.87k stars 179 forks source link

stuck in the loader when using only cpu #350

Closed NaufalRezkyA closed 1 year ago

NaufalRezkyA commented 1 year ago

Beforehand, I was able to run FFCV on imagenet dataset smoothly using GPU. but I want to try to run it using only the CPU. By changing torch.device to cpu. But this causes stuck (loop forever).

here is my loader:

    IMAGENET_MEAN = np.array([0.485, 0.456, 0.406]) * 255
    IMAGENET_STD = np.array([0.229, 0.224, 0.225]) * 255

    # ToDevice(torch.device('cuda:0'), non_blocking=True),
    train_image_pipeline = [
        RandomResizedCropRGBImageDecoder((224, 224)),
        RandomHorizontalFlip(),
        ToTensor(),
        ToDevice(torch.device('cpu')),
        ToTorchImage(),
        NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float32)
    ]

    val_image_pipeline = [
        CenterCropRGBImageDecoder((224, 224), ratio=224/256),
        ToTensor(),
        ToDevice(torch.device('cpu')),
        ToTorchImage(),
        NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float32)
    ]

    label_pipeline = [IntDecoder(), ToTensor(), Squeeze(),
                      ToDevice(torch.device('cpu'))]

    # print("args.gpu ->", args.gpu)
    train_loader = Loader('/home/cc/data/train_500_0.50_90.ffcv', batch_size=args.batch_size, num_workers=args.workers,
                          order=OrderOption.RANDOM,
                          pipelines={'image': train_image_pipeline, 'label': label_pipeline})

    val_loader = Loader('/home/cc/data/val_500_0.50_90.ffcv', batch_size=args.batch_size, num_workers=args.workers,
                        order=OrderOption.SEQUENTIAL,
                        pipelines={'image': val_image_pipeline, 'label': label_pipeline},
                        )

I tried to trace it and found that the stuck occurred in the graph.py file in the group_operation() function. The stuck happens because when we enter Normalize the operation state will be replaced and jit_mode will be changed to True. because node.is_jitted = True and jitted_stage = False, the jitted condition cannot be performed which causes a forever loop.

Here is group_operation from Graph class in ffcv/pipeline/graph.py

def group_operations(self):
        current_front: Set[Node] = set()
        next_front: Set[Node] = set()
        stages = []

        print("self.adjacency_list ->", self.adjacency_list)
        print("self.root_nodes ->", self.root_nodes)
        for node in self.root_nodes.keys():
            current_front.add(node)

        while current_front:
            print("current_front ->", current_front)
            current_stage = list()
            print("current_stage", current_stage)
            jitted_stage = len(stages) % 2 == 0

            stopped = 0
            while current_front:
                # if stopped == 25:
                    # quit()
                print("current_front2 ->", current_front)
                node = current_front.pop()
                print("node ->", node)
                print("node.is_jitted ->", node.is_jitted)
                print("jitted_stage ->", jitted_stage)
                if node.is_jitted == jitted_stage or node.is_jitted is None:
                    print("jitted..")
                    current_stage.append(self.node_to_id[node])
                    current_front.update(set(self.adjacency_list[node]))

                else:
                    print("not jitted..")
                    next_front.add(node)

                stopped +=1
                print("current_stage ->", current_stage)

            stages.append(current_stage)
            print("stages-graph ->", stages)
            current_front = next_front

        return stages

here is function code from ffcv/transforms/normalize.py where they assign jit_mode to True if we are using CPU.

    def declare_state_and_memory(self, previous_state: State) -> Tuple[State, Optional[AllocationQuery]]:

        if previous_state.device == ch.device('cpu'):
            new_state = replace(previous_state, jit_mode=True, dtype=self.dtype)
            return new_state, AllocationQuery(
                shape=previous_state.shape,
                dtype=self.dtype,
                device=previous_state.device
            )

     ...

and these are the output when it got stuck:

(ffcv) cc@gpufs-naufal-rtx6000-ffcv:~/gpufs/ffcv$ python main-original-ffcv-v2-cuda-emulatorv0-cpu.py -a resnet18 --lr 0.1 ~/data/imagenette2/ --epochs 2
.
.
.
  current_front2 -> {TransformerNode(<ffcv.transforms.normalize.NormalizeImage object at 0x7ff865a52d00>)}
  node -> TransformerNode(<ffcv.transforms.normalize.NormalizeImage object at 0x7ff865a52d00>)
  node.is_jitted -> True
  jitted_stage -> False
  not jitted..
  current_stage -> [2, 7, 8, 3, 4, 9]
  current_front2 -> {TransformerNode(<ffcv.transforms.normalize.NormalizeImage object at 0x7ff865a52d00>)}
  node -> TransformerNode(<ffcv.transforms.normalize.NormalizeImage object at 0x7ff865a52d00>)
  node.is_jitted -> True
  jitted_stage -> False
  not jitted..
  current_stage -> [2, 7, 8, 3, 4, 9]
  current_front2 -> {TransformerNode(<ffcv.transforms.normalize.NormalizeImage object at 0x7ff865a52d00>)}
  node -> TransformerNode(<ffcv.transforms.normalize.NormalizeImage object at 0x7ff865a52d00>)
  node.is_jitted -> True
  jitted_stage -> False
  not jitted..
  current_stage -> [2, 7, 8, 3, 4, 9]
.
.
.
(looping forever)

Any suggestions so I can use CPU in this case?

mengwanguc commented 1 year ago

Hi @GuillaumeLeclerc @andrewilyas , do you have any suggestion on how we should run FFCV on CPU node?

Thanks! Meng

andrewilyas commented 1 year ago

Hi @mengwanguc ! We haven't tried running FFCV on a CPU-only node and it isn't officially supported - our team has very low bandwidth at the moment and won't be able to investigate, but if you make any headway we are happy to add documentation!

mengwanguc commented 11 months ago

Hi @mengwanguc ! We haven't tried running FFCV on a CPU-only node and it isn't officially supported - our team has very low bandwidth at the moment and won't be able to investigate, but if you make any headway we are happy to add documentation!

Hi @andrewilyas , thanks for the reply!

I'm opening this issue again as I have some follow-up questions:

Is there anything in FFCV that would conceptually prevent us using FFCV on CPU-only nodes? (e.g. some optimization/functionality/code that is deeply coupled with CUDA/GPU to compile/install/run).

Or do you think it is conceptually doable, but would take time to figure out the correct environments?

mengwanguc commented 11 months ago

looks like I cannot reopen this issue, so I'm opening another new issue for this question: https://github.com/libffcv/ffcv/issues/359