topal-team / rockmate

GNU General Public License v3.0
30 stars 4 forks source link

Question in inspection #3

Open vivian96385 opened 7 months ago

vivian96385 commented 7 months ago

Hi,

I'm currently looking at how the graph builder is written, and I have a few questions that I'm not sure about.

In the rkgb/src/utils/def_inspection.py file, there are two functions fct_fgt_fwd and fct_del_fwd in class inpector. I'm wondering what the difference is between them. I've read the appendix of the paper, but I'm still a bit confused.

       def fct_fgt_fwd():
            for tar in self.sn.tensor_targets:
                val = self.tmp_local[tar]
                val.data = torch.zeros(0,device=self.device)
                if val._base is not None:
                    val._base.data = torch.empty(0,device=self.device) 
       def fct_del_fwd():
            code = ""
            for tar in self.sn.tensor_targets:
                code += f"del {tar};" # del tensor1;del tensor2;...
            self.code_del_fwd = code
            exec(self.code_del_fwd, self.our_global, self.tmp_local)

On the other hand, I'm wondering why inspection is not supported on the CPU?

Looking forward to your response! Thanks!

TheotimeLH commented 7 months ago

Hi !

About memory values, as you saw mem_run is the difference between the amount of memory used before and after creating the variable, eg y = f(x), then mem_fgt after we delete y.data, and finally mem_del if we do del y. When y is created, it comes with a tensor, stored in y.data and some other saved values, inside y.grad_fn. Thus mem_run = mem_fgt + "saved_tensors". BUT yes mem_del is completely useless, as mem_del == mem_run. It was created at a time we weren't sure about the best way to "forget" or "delete" (for instance, we were wondering whether some things were stored in x after creating y), and it has remain despite it's uselessness. It will be removed in future versions.

Speaking of which, to save you some time, we'd like to inform you that we have been working on a new version. In particular, on the graph building side, I rewrote all the files to make them more readable and remove all useless old pieces of code, make the code more general, fix some bugs etc. This new version will keep the same general structure, with some additional type of graphs, but the names will change. And I'd like to apology as the current code can be hard to understand. We're looking forward to sharing the new version as soon as possible, but unfortunately we'll have to wait a little longer. In the meantime, don't hesitate to contact us if you have any further questions.

Concerning the CPU inspection, the initial objective was to precisely control GPU memory usage, as rkgb was written for Rockmate. But I agree that it would be nice to have the information even when we're on the CPU, so we'll see what's possible for the new version. What we're missing at the moment is an equivalent of torch.cuda.max_memory to get the maximum memory usage, but we will seek for a solution.

Thank you for your interest !!