Extra inputs arguments vs keeping things in options

Buyun-Liang commented 2 years ago

From Tim:

_Why are there extra inputs arguments (var_dim_map, nn_model, etc) as opposed to adding them to options?_

t-mitchell commented 2 years ago

If an input argument is helping to specify the optimization problem given by the user, then it probably makes most sense to leave it as an input along side combinedFunction etc and not move it to options. It's also less work for you ;-)

So, provided I understand your Python code, that philosophy would mean that var_dim_map and nn_model should stay as inputs, while torch_device would be better placed as an option.

t-mitchell commented 2 years ago

Relatedly, since you mentioned you wanted to make the interface more compact, does it make sense to combine var_dim_map and nn_model into a single variable? Currently both are optional arguments with defaults of none, but the doc isn't clear what happens if the user doesn't provide either. But it seems the user is supposed to provide one or the other, right? So you could consider combining them into a single required variable, e.g., var_spec, which could take on, say, the following different values:

some positive integer giving the number of variables and then X_struct is just a vector (like GRANSO does)
nn_model variable provided by user for neural-network mode
some dictionary, e.g., {"x": [1,1]}; var_in = {"U": [5,10], "V": [10,20]} for X_struct

If you prefer, it could also just do 2. or 3. and not have the 1. mode. For better consistency with granso(n, ...), I'd suggest also making var_spec the first input argument.

But if you think this is all a bad idea or will cause issues for you down the road, feel free to pass! It's just a suggestion.

BTW, having var_dim_map and X_struct in order to relieve the user of having to separate variables and reshape them appropriately is a very nice idea.)

Buyun-Liang commented 2 years ago

So, provided I understand your Python code, that philosophy would mean that var_dim_map and nn_model should stay as inputs, while torch_device would be better placed as an option.

Yes, torch_device will be add into options. user_data will be removed.

Relatedly, since you mentioned you wanted to make the interface more compact, does it make sense to combine var_dim_map and nn_model into a single variable? Currently both are optional arguments with defaults of none, but the doc isn't clear what happens if the user doesn't provide either. But it seems the user is supposed to provide one or the other, right? So you could consider combining them into a single required variable, e.g., var_spec, which could take on, say, the following different values: 1 ...;2...;3...

Agree. Will combine them into a single variable var_spec.

I'll do 2 and 3. But for 1, currently we can represent vector as {"x": [n]}. I forget to mention that the values in the dictionary can be a list with arbitrary integer length. e.g., [n] for vector, [m,n] for matrices, [c,w,d] for image tensor used in deep learning problem. Thus I think 1 is not necessary and may confuse users.

I'm even working on remove 2, so that user is able to represent nn_model with X_struct only. But for now, I'll keep both 2 and 3

t-mitchell commented 2 years ago

I think it would be good to always keep option 2 for var_spec because it is so general and people may use the code for other things where nn_model might be unfamiliar and/or inconvenient to users.

As for option 1, I don't feel so strongly either way. But I will say that some users do like short cuts for common cases to save typing, e.g., var_spec=5 could be a shortcut for var_spec={"x":[5]}.

Regarding the arbitrary length thing, I am not sure what you mean. For {"x":[n]}, doesn't n just have an assigned value here that just gets substituted in here?

Buyun-Liang commented 2 years ago

Regarding the arbitrary length thing, I am not sure what you mean. For {"x":[n]}, doesn't n just have an assigned value here that just gets substituted in here?

The tensor that specified the variable shape can have arbitrary dimension. In deep learning, it's pretty common that the weights (optimization variable) is a high dimension tensor like [n1,n2,n3,n4]

t-mitchell commented 2 years ago

Yes, I think I understood that from our earlier discussions, i.e., that each member of X_struct can be some arbitrarily dimensioned n-D array (or tensor in PyTorch) and you give those dimensions in var_dim_map for each key.

Buyun-Liang commented 2 years ago

Regarding input args of pygranso.py

torch_device was added into options. user_data was removed. var_dim_map and nn_model were combined into required arg var_spec

t-mitchell commented 2 years ago

Regarding the input arg to dos:

[x] I still suggest making var_spec the very first argument of pygranso since it is more consistent with granso(n,combinedFunction), since var_spec and n are analogues of each other.
[x] You should add doc for the torch_device to pygransoOptions.py

Some other comments:

The doc for var_spec is much better! Also, it is now clear that var_dim_map is always needed while nn_model is optional, which wasn't clear before.
BTW, I am still unsure why objEvalFunction is needed in addition to combinedFunction. The docstring doesn't tell me that much. What are you objEvalFunction for and how is it different from combinedFunction?
shouldn't pygransoOptionsAdvanced call pygransoOptions so that it actually works?

Buyun-Liang commented 2 years ago

I still suggest making var_spec the very first argument of pygranso since it is more consistent with granso(n,combinedFunction), since var_spec and n are analogues of each other.

You should add doc for the torch_device to pygransoOptions.py

Done

BTW, I am still unsure why objEvalFunction is needed in addition to combinedFunction. The docstring doesn't tell me that much. What are you objEvalFunction for and how is it different from combinedFunction?

I noticed that GRANSO can allow different format of obj and constrained functions, e.g., in seperate or combined way. However, I only keep the combinedFunction as the only allowed form of input arg for pygranso, as I think combined both objective and constrained function is at least as efficient as separate way. In certain cases that expensive calculations are shared in both obj and const, combined functions will be much more efficient.

As described above, the alg is not able to obtain obj and constr val from combinedFuinction without evaluating gradients. Since backtracking line search doesn't need the gradient information, objEvalFunction is used to improve efficiency. It can provide obj and constraint function val without evaluating gradients.

But we don't recommend user to use it in this version, and currently no examples used this arg. It's just used for some constrained deep learning problems we're currently developing. I'm also thinking about how to remove or merge it.

shouldn't pygransoOptionsAdvanced call pygransoOptions so that it actually works?

From on the GRANSO docstring
% This "advanced" version exists mostly just to break up the help % documentation so the user isn't overwhelmed with details on the more % advanced parameters which they may not immediately (or ever) need.

Why do we want to call pygransoOptions when pygransoOptionsAdvanced is not used?

t-mitchell commented 2 years ago

Re: GRANSO's two formats of either the single combined_fn or the separate obj_fn, ci_fn, ce_fn, yes, I too prefer the combined_fn particularly since some computations may be shared between the objective and constraint (like ex4 in GRANSO). But I decided to support both formats in GRANSO anyway because some users have problems that are coded up with separate functions, and so having the separate format makes things a bit easier for some users. But I think it's fine to only support combined_fn in PyGRANSO.

Re: objEvalFunction and other "not recommended" features, one easy option for PyGRANSO v1.0.0 is to just remove them as input arguments from pygranso and options in pygransoOptions and delete the corresponding descriptions in doc and the website. This way users will never know about this in-progress stuff in v1.0.0 that you don't recommend using yet, and users won't be confused by it either. And you can still leave all the internal code alone so it will be easy to add back these features in the new developer branch. But I understand if you and Ju prefer not to do this.

If you want have a chat about possible ideas for removing objEvalFunction and/or disabling/enabling AD before the v1.0.0 release, I'd also be happy to do another Zoom call. Just let me know.

Why do we want to call pygransoOptions when pygransoOptionsAdvanced is not used?

Only because both option routines work in GRANSO (even though they do the same thing), and there's no real reason to break consistency; it should only be one line of code to make pygransoOptionsAdvanced be a wrapper for pygransoOptions. But I agree this isn't super important.

Buyun-Liang commented 2 years ago

Regarding objEvalFunction: can we merge objEvalFunction with combinedFunction so that there are two possible cases(just like what we did for var_spec):

1. fn = combinedFunction 
2. fn = [combinedFunction,objEvalFunction]

Only because both option routines work in GRANSO (even though they do the same thing), and there's no real reason to break consistency; it should only be one line of code to make pygransoOptionsAdvanced be a wrapper for pygransoOptions. But I agree this isn't super important.

Done

Buyun-Liang commented 2 years ago

As talked with Ju before, we would like to leave disabling/enabling AD in future release.

Buyun-Liang commented 2 years ago

I would be happy to have a zoom call to summarize things to do. Please let me know your available time

t-mitchell commented 2 years ago

Regarding objEvalFunction: can we merge objEvalFunction with combinedFunction so that there are two possible cases(just like what we did for var_spec):
1. fn = combinedFunction 
2. fn = [combinedFunction,objEvalFunction]

That's also a decent option. But I think you may end up changing this interface a bit for a later release when you add the extra AD and gradient features. So my only real suggestion is to pick something that is both easy and easy to change later.

t-mitchell commented 2 years ago

I would be happy to have a zoom call to summarize things to do. Please let me know your available time

If that would helpful for you, we can do it, or we can just wait until you think things are done and I should review.

t-mitchell commented 2 years ago

As talked with Ju before, we would like to leave disabling/enabling AD in future release.

Yes, I know. My point was that this stuff may affect the interface, so it might be worth thinking about it now, even if you don't implement it. But if you don't mind changing the interface later, then it's not a big deal.

Buyun-Liang commented 2 years ago

Yes, I know. My point was that this stuff may affect the interface, so it might be worth thinking about it now, even if you don't implement it. But if you don't mind changing the interface later, then it's not a big deal.

Just talked with Ju, he agrees with you and thinks changing interface frequently can be very disturbing for users.

So for the v1.0.0 release, I'll add the disabling/enabling AD part, and try to delete the objEvalFunction.

From Ju:

one possibility is just follow Manopt idea, if no analytic grad is specified, AD will be called. if obj and constraint functions are objects, analytic grads are fields in the obj

Buyun-Liang commented 2 years ago

If that would helpful for you, we can do it, or we can just wait until you think things are done and I should review.

OK, let's discuss after I get things done

t-mitchell commented 2 years ago

Yes, I know. My point was that this stuff may affect the interface, so it might be worth thinking about it now, even if you don't implement it. But if you don't mind changing the interface later, then it's not a big deal.

Just talked with Ju, he agrees with you and thinks changing interface frequently can be very disturbing for users.

So for the v1.0.0 release, I'll add the disabling/enabling AD part, and try to delete the objEvalFunction.

I was wondering if enabling/disabling AD could be done via an internal wrapper function that calls combinedFunction. I don't know if this idea will work with PyTorch but maybe. So you might have something like [f,fg,ci,cig,ce,ceg] = wrapper(X_Struct,compute_grads) and depending on compute_grads, it would either enable or disable AD.

If you want to discuss ideas, I'd be happy to talk about this sometime tomorrow or the next day.

Buyun-Liang commented 2 years ago

I was wondering if enabling/disabling AD could be done via an internal wrapper function that calls combinedFunction. I don't know if this idea will work with PyTorch but maybe. So you might have something like [f,fg,ci,cig,ce,ceg] = wrapper(X_Struct,compute_grads) and depending on compute_grads, it would either enable or disable AD.

Yes, good idea. I'll try to implement this tonight, and also update the objEvalFunction part. Then we can have a zoom call tomorrow to summarize things to do before initial release. I'm available anytime after 10:00 GMT-6 (17:00 in your time) tomorrow (Dec 25 Sat).

Buyun-Liang commented 2 years ago

Enabling/disabling AD has been allowed. Please check line 190 in test_cuda.py or test_cpu.py for example usage.

Generally speaking, we have following two types of user input:

[f,ci,ce] = combinedFunction(X), same what we have before. f,ci,ce should be computational graph on optimization variables.
[f,f_grad,ci,ci_grad,ce,ce_grad] = combinedFunction(X), GRANSO style input, the shape of user provided gradient should be the same as what required in GRANSO.

Buyun-Liang commented 2 years ago

Merged combinedFunction and objEvalFunction. user_fn can be one of the following two values: 1.combinedFunction 2.[combinedFunction,objEvalFunction]: if backtracking line search has been enabled

Now, the pygranso.py interface becomes pygranso(var_spec,user_fn,user_opts=None)

t-mitchell commented 2 years ago

This all looks like great progress, but I haven't had a chance to dive into the code yet. It might be best to look at the code changes together when we Zoom.

Buyun-Liang commented 2 years ago

Update:

rename all dummy classes to pygransoStruct
remove the require_grad from user provided function
update the function used for obtaining penalty function without getting gradients (used in backtracking line search). In addition, at the end of backtracking line search, the function evaluateAtX (directly tanslated from GRANSO) will be called to update all information including gradients at the current iteration
provided example for partial AD, please check line 203-227 in test_cuda.py. But it looks ugly as I haven't design the interface for getCiGradVec. Also user has to detach all return values from computational graphs

For now, all features we discussed on Saturday has been implemented. Please let me know if there is any other thing we want to add before the initial release.

t-mitchell commented 2 years ago

Some comments and possible todo's:

[x] Since the user still has to call pygransoStruct.py, I think it should be moved back to the main directory, as otherwise, users likely won't easily see it in the private folder.
[x] Doesn't using atry in tensor2vec.py mean that Python has to execute the entire body of combinedFunction before it figures out whether it returns 3 or 6 output arguments? If so, this means that the code is unnecessarily evaluating all the functions (and ADing the gradients) twice per each value of X_struct I think. Unlike MATLAB, I don't think Python has an nargout equivalent for getting the number of output arguments of a function, so maybe it is better to have a global AD on/off option in pygransoOptions, which will then also determine whether combinedFunction must return 3 or 6 output arguments. Then we can replace the try with an if clause and avoid doubling the computation.
[x] When global AD is turned on, wouldn't it make the most sense to have the .detach calls also be in the tensor2vec.py wrapper so that overhead is eliminated as soon as possible?
[x] I guess you can add my name after yours to the copyright for tensor2vec.py since I also had input in its design, but this is not so important.

Regarding the partial AD example:

It seems this example is computing all gradients via AD. Is that correct? If so, maybe we should call it manual AD and point out that this can be useful for mixing AD with explicit gradients. Or we can change it to such a hybrid example. Alternatively, since hybrid AD is for advanced users, we could just eliminate documenting hybrid AD in v1.0.0 (even though v1.0.0 supports it) and push hybrid doc and examples to a later release.
For manual/hybrid AD, I think the user will have to do all the requires_grad and detach calls themselves. Having our code automatically handle a mix of AD and explicit gradients may actually be quite difficult for us to implement.

If it would be helpful to talk about the code, I do have time today to get on Zoom. Also, for the v1.0.0 release, it might be good for us to do one last walk through together, just to make sure there are no big issues we haven't noticed yet.

Buyun-Liang commented 2 years ago

It seems this example is computing all gradients via AD. Is that correct? If so, maybe we should call it manual AD and point out that this can be useful for mixing AD with explicit gradients. Or we can change it to such a hybrid example. Alternatively, since hybrid AD is for advanced users, we could just eliminate documenting hybrid AD in v1.0.0 (even though v1.0.0 supports it) and push hybrid doc and examples to a later release.

No, the f_grad is explicitly provided by user while ce_grad is manual AD.

    def comb_fn(X_struct):
        q = X_struct.q
        q.requires_grad_(True)

        # objective function
        qtY = q.T @ Y
        f = 1/m * torch.norm(qtY, p = 1).item()
        f_grad = 1/m*Y@torch.sign(Y.T@q) # explicitly provided by user (analytical gradients)

        # inequality constraint, matrix form
        ci = None
        ci_grad = None

        # equality constraint 
        ce = q.T @ q - 1
        # manual AD
        ce_grad = getCiGradVec(nvar=n,nconstr_ci_total=1,var_dim_map=var_in,X=X_struct,ci_vec_torch=ce,torch_device=device,double_precision=torch.double) 

        # return value must be detached from the computational graph
        f_grad = f_grad.detach()
        ce = ce.detach()
        ce_grad = ce_grad.detach()

        return [f,f_grad,ci,ci_grad,ce,ce_grad]

Yes, I agree to leave document for partial AD in future release

For manual/hybrid AD, I think the user will have to do all the requires_grad and detach calls themselves. Having our code automatically handle a mix of AD and explicit gradients may actually be quite difficult for us to implement.

Yes, that's what I'm doing in the code above: user has to perform requires_grad and detach calls themselves

Buyun-Liang commented 2 years ago

When global AD is turned on, wouldn't it make the most sense to have the .detach calls also be in the tensor2vec.py wrapper so that overhead is eliminated as soon as possible?

When the global AD is turned on (I would set this as default as most user want global AD), the user never need to call detach. Example code for global AD:

    def comb_fn(X_struct):
        q = X_struct.q

        # objective function
        qtY = q.T @ Y
        f = 1/m * torch.norm(qtY, p = 1)

        # inequality constraint, matrix form
        ci = None

        # equality constraint
        ce = pygransoStruct()
        ce.c1 = q.T @ q - 1

        return [f,ci,ce]

t-mitchell commented 2 years ago

Regarding the hybrid example, my point was that by having q.requires_grad_(True) in the very beginning, doesn't that mean that any subsequent computation involving q has the AD overhead done in the background, even if you explicitly compute the gradients? But maybe I am wrong on this.

t-mitchell commented 2 years ago

When the global AD is turned on (I would set this as default as most user want global AD), the user never need to call detach. Example code for global AD:

Yes, global AD should be enabled by default.

t-mitchell commented 2 years ago

When global AD is turned on, wouldn't it make the most sense to have the .detach calls also be in the tensor2vec.py wrapper so that overhead is eliminated as soon as possible?

When the global AD is turned on (I would set this as default as most user want global AD), the user never need to call detach. Example code for global AD:
    def comb_fn(X_struct):
        q = X_struct.q

        # objective function
        qtY = q.T @ Y
        f = 1/m * torch.norm(qtY, p = 1)

        # inequality constraint, matrix form
        ci = None

        # equality constraint
        ce = pygransoStruct()
        ce.c1 = q.T @ q - 1

        return [f,ci,ce]

This may be something I don't yet understand with your port, but why is pygransoStruct and ce.c1 needed here? Why not just ce = q.T @ q - 1?

Buyun-Liang commented 2 years ago

Regarding the hybrid example, my point was that by having q.requiresgrad(True) in the very beginning, doesn't that mean that any subsequent computation involving q has the AD overhead done in the background, even if you explicitly compute the gradients? But maybe I am wrong on this.

We can do that: before calculate analytical gradientf_grad = 1/m*Y@torch.sign(Y.T@q), user can add q_tmp = q.detach().clone() and use q_tmp for the calculation to avoid overhead

Buyun-Liang commented 2 years ago

This may be something I don't yet understand with your port, but why is pygransoStruct and ce.c1 needed here? Why not just ce = q.T @ q - 1?

Some examples may have multiple constraints as we are not asking user to provide all constraints in a big matrix:

ce.c1 = ...
ce.c2 = ...
ce.c3 = ...
...

t-mitchell commented 2 years ago

Regarding the hybrid example, my point was that by having q.requiresgrad(True) in the very beginning, doesn't that mean that any subsequent computation involving q has the AD overhead done in the background, even if you explicitly compute the gradients? But maybe I am wrong on this.

We can do that: before calculate analytical gradientf_grad = 1/m*Y@torch.sign(Y.T@q), user can add q_tmp = q.detach().clone() and use q_tmp for the calculation to avoid overhead

Yes, that is exactly what I had in mind when we were on Zoom and I sketched out the rough idea for fixing the gradient stuff. If you update the example like this, with some comments that explain why this a key step, then it is fine to leave the hybrid example in the documentation. But I'd leave doing this to the very end because other issues are more important.

t-mitchell commented 2 years ago

This may be something I don't yet understand with your port, but why is pygransoStruct and ce.c1 needed here? Why not just ce = q.T @ q - 1?

Some examples may have multiple constraints as we are not asking user to provide all constraints in a big matrix:
ce.c1 = ...
ce.c2 = ...
ce.c3 = ...
...

Ah, I see. So this is a place where you've deviated from the GRANSO interface. Was this change necessary or it was just something you and Ju preferred? Does each field, say, .c1, have to be a scalar value or can .c1 still be a vector of individual constraints?

Buyun-Liang commented 2 years ago

Some examples may have multiple constraints as we are not asking user to provide all constraints in a big matrix:
ce.c1 = ...
ce.c2 = ...
ce.c3 = ...
...

For this one, I have a example to explain the importance to allow multiple constraints:

obj is f(X,Y,Z), X,Y,Z are all matrices

ce.c1 = c1(X)
ce.c2 = c2(Y)
ce.c3 = c3(Z)

Buyun-Liang commented 2 years ago

Ah, I see. So this is a place where you've deviated from the GRANSO interface. Was this change necessary or it was just something you and Ju preferred? Does each field, say, .c1, have to be a scalar value or can .c1 still be a vector of individual constraints?

It's useful in many machine/deep learning problems, when each constraint is only on part of the variables, and multiple constraints are involved. See the simple example above.

.c1 can be scalar, vector, matrix, or even higher order tensor. As PyGRANSO treat each individual element in .c1 as 1 constraint

t-mitchell commented 2 years ago

Hehe, you already clarified what I was just about to ask you again. ;-) Re: the ce.c1 stuff, I was just worried that it might only support scalars, which would then make something like ce.c1 = Ax - b not possible. If that were the case, then each inequality constraint would have to be entered manually as .c1, .c2, etc, which would be very tedious for users. Glad to hear that it supports any sized tensor. I agree this seems like a nice interface for users and not just for ML people.

BTW, GRANSO does not use a diagonal matrix for the constraints. It uses a vector for the constraint values and then a matrix for the corresponding gradients, where the kth column is the gradient for the kth constraint.

Buyun-Liang commented 2 years ago

BTW, GRANSO does not use a diagonal matrix for the constraints. It uses a vector for the constraint values and then a matrix for the corresponding gradients, where the kth column is the gradient for the kth constraint.

Yes, I edited the comment above, as I realized block diagonal matrix is used in ce_grad when we have this kind of constraint:

ce.c1 = c1(X)
ce.c2 = c2(Y)
ce.c3 = c3(Z)

Buyun-Liang commented 2 years ago

Since the user still has to call pygransoStruct.py, I think it should be moved back to the main directory, as otherwise, users likely won't easily see it in the private folder.

Done

Doesn't using atry in tensor2vec.py mean that Python has to execute the entire body of combinedFunction before it figures out whether it returns 3 or 6 output arguments? If so, this means that the code is unnecessarily evaluating all the functions (and ADing the gradients) twice per each value of X_struct I think. Unlike MATLAB, I don't think Python has an nargout equivalent for getting the number of output arguments of a function, so maybe it is better to have a global AD on/off option in pygransoOptions, which will then also determine whether combinedFunction must return 3 or 6 output arguments. Then we can replace the try with an if clause and avoid doubling the computation.

Done. globalAD added in option

When global AD is turned on, wouldn't it make the most sense to have the .detach calls also be in the tensor2vec.py wrapper so that overhead is eliminated as soon as possible?

Done, call detach() immediately after the AD.

I guess you can add my name after yours to the copyright for tensor2vec.py since I also had input in its design, but this is not so important.

Done

Buyun-Liang commented 2 years ago

Any other thing you have in mind that we need to update? If not, we may have a zoom meeting to go through all the code.

t-mitchell commented 2 years ago

For the code, I don't think so, but it would be good to do a Zoom meeting to walk through the code together.

I guess the website will need a lot of updates with all the license, citation, and code changes. I can also help review that stuff if you let me know where to look (not sure if you have that stuff in a private repo).

The examples in the paper will also need updating but now they'll be simpler. ;-)

Buyun-Liang commented 2 years ago

Yes, my plan is finalize source code first then update examples, website and paper, since we want to make sure the interface is not changed.

I'm available almost anytime today and tomorrow. Please let me know your available time.

t-mitchell commented 2 years ago

How about now?

Buyun-Liang commented 2 years ago

Sure. Invitation sent

Buyun-Liang commented 2 years ago

Updated the source code (except the example folder) as we discussed in the meeting.

sun-umn / PyGRANSO

Extra inputs arguments vs keeping things in options #14