Closed Buyun-Liang closed 2 years ago
If an input argument is helping to specify the optimization problem given by the user, then it probably makes most sense to leave it as an input along side combinedFunction
etc and not move it to options. It's also less work for you ;-)
So, provided I understand your Python code, that philosophy would mean that var_dim_map
and nn_model
should stay as inputs, while torch_device
would be better placed as an option.
Relatedly, since you mentioned you wanted to make the interface more compact, does it make sense to combine var_dim_map
and nn_model
into a single variable? Currently both are optional arguments with defaults of none
, but the doc isn't clear what happens if the user doesn't provide either. But it seems the user is supposed to provide one or the other, right? So you could consider combining them into a single required variable, e.g., var_spec
, which could take on, say, the following different values:
X_struct
is just a vector (like GRANSO does)nn_model
variable provided by user for neural-network mode{"x": [1,1]}; var_in = {"U": [5,10], "V": [10,20]}
for X_struct
If you prefer, it could also just do 2. or 3. and not have the 1. mode. For better consistency with granso(n, ...)
, I'd suggest also making var_spec
the first input argument.
But if you think this is all a bad idea or will cause issues for you down the road, feel free to pass! It's just a suggestion.
BTW, having var_dim_map
and X_struct
in order to relieve the user of having to separate variables and reshape them appropriately is a very nice idea.)
So, provided I understand your Python code, that philosophy would mean that var_dim_map and nn_model should stay as inputs, while torch_device would be better placed as an option.
Yes, torch_device will be add into options. user_data will be removed.
Relatedly, since you mentioned you wanted to make the interface more compact, does it make sense to combine var_dim_map and nn_model into a single variable? Currently both are optional arguments with defaults of none, but the doc isn't clear what happens if the user doesn't provide either. But it seems the user is supposed to provide one or the other, right? So you could consider combining them into a single required variable, e.g., var_spec, which could take on, say, the following different values: 1 ...;2...;3...
Agree. Will combine them into a single variable var_spec.
I'll do 2 and 3. But for 1, currently we can represent vector as {"x": [n]}. I forget to mention that the values in the dictionary can be a list with arbitrary integer length. e.g., [n] for vector, [m,n] for matrices, [c,w,d] for image tensor used in deep learning problem. Thus I think 1 is not necessary and may confuse users.
I'm even working on remove 2, so that user is able to represent nn_model with X_struct only. But for now, I'll keep both 2 and 3
I think it would be good to always keep option 2 for var_spec
because it is so general and people may use the code for other things where nn_model
might be unfamiliar and/or inconvenient to users.
As for option 1, I don't feel so strongly either way. But I will say that some users do like short cuts for common cases to save typing, e.g., var_spec=5
could be a shortcut for var_spec={"x":[5]}
.
Regarding the arbitrary length thing, I am not sure what you mean. For {"x":[n]}
, doesn't n
just have an assigned value here that just gets substituted in here?
Regarding the arbitrary length thing, I am not sure what you mean. For {"x":[n]}, doesn't n just have an assigned value here that just gets substituted in here?
The tensor that specified the variable shape can have arbitrary dimension. In deep learning, it's pretty common that the weights (optimization variable) is a high dimension tensor like [n1,n2,n3,n4]
Yes, I think I understood that from our earlier discussions, i.e., that each member of X_struct
can be some arbitrarily dimensioned n-D array (or tensor in PyTorch) and you give those dimensions in var_dim_map
for each key.
Regarding input args of pygranso.py
torch_device
was added into options.
user_data
was removed.
var_dim_map and nn_model
were combined into required arg var_spec
Regarding the input arg to dos:
var_spec
the very first argument of pygranso
since it is more consistent with granso(n,combinedFunction)
, since var_spec
and n
are analogues of each other.torch_device
to pygransoOptions.py
Some other comments:
var_spec
is much better! Also, it is now clear that var_dim_map
is always needed while nn_model
is optional, which wasn't clear before.objEvalFunction
is needed in addition to combinedFunction
. The docstring doesn't tell me that much. What are you objEvalFunction
for and how is it different from combinedFunction
?pygransoOptionsAdvanced
call pygransoOptions
so that it actually works?I still suggest making var_spec the very first argument of pygranso since it is more consistent with granso(n,combinedFunction), since var_spec and n are analogues of each other.
You should add doc for the torch_device to pygransoOptions.py
Done
BTW, I am still unsure why objEvalFunction is needed in addition to combinedFunction. The docstring doesn't tell me that much. What are you objEvalFunction for and how is it different from combinedFunction?
I noticed that GRANSO can allow different format of obj and constrained functions, e.g., in seperate or combined way. However, I only keep the combinedFunction
as the only allowed form of input arg for pygranso, as I think combined both objective and constrained function is at least as efficient as separate way. In certain cases that expensive calculations are shared in both obj and const, combined functions will be much more efficient.
As described above, the alg is not able to obtain obj and constr val from combinedFuinction
without evaluating gradients. Since backtracking line search doesn't need the gradient information, objEvalFunction
is used to improve efficiency. It can provide obj and constraint function val without evaluating gradients.
But we don't recommend user to use it in this version, and currently no examples used this arg. It's just used for some constrained deep learning problems we're currently developing. I'm also thinking about how to remove or merge it.
shouldn't pygransoOptionsAdvanced call pygransoOptions so that it actually works?
From on the GRANSO docstring
% This "advanced" version exists mostly just to break up the help
% documentation so the user isn't overwhelmed with details on the more
% advanced parameters which they may not immediately (or ever) need.
Why do we want to call pygransoOptions
when pygransoOptionsAdvanced
is not used?
Re: GRANSO's two formats of either the single combined_fn
or the separate obj_fn
, ci_fn
, ce_fn
, yes, I too prefer the combined_fn
particularly since some computations may be shared between the objective and constraint (like ex4 in GRANSO). But I decided to support both formats in GRANSO anyway because some users have problems that are coded up with separate functions, and so having the separate format makes things a bit easier for some users. But I think it's fine to only support combined_fn
in PyGRANSO.
Re: objEvalFunction
and other "not recommended" features, one easy option for PyGRANSO v1.0.0 is to just remove them as input arguments from pygranso
and options in pygransoOptions
and delete the corresponding descriptions in doc and the website. This way users will never know about this in-progress stuff in v1.0.0 that you don't recommend using yet, and users won't be confused by it either. And you can still leave all the internal code alone so it will be easy to add back these features in the new developer branch. But I understand if you and Ju prefer not to do this.
If you want have a chat about possible ideas for removing objEvalFunction
and/or disabling/enabling AD before the v1.0.0 release, I'd also be happy to do another Zoom call. Just let me know.
Why do we want to call pygransoOptions when pygransoOptionsAdvanced is not used?
Only because both option routines work in GRANSO (even though they do the same thing), and there's no real reason to break consistency; it should only be one line of code to make pygransoOptionsAdvanced
be a wrapper for pygransoOptions
. But I agree this isn't super important.
Regarding objEvalFunction
: can we merge objEvalFunction
with combinedFunction
so that there are two possible cases(just like what we did for var_spec):
1. fn = combinedFunction
2. fn = [combinedFunction,objEvalFunction]
Only because both option routines work in GRANSO (even though they do the same thing), and there's no real reason to break consistency; it should only be one line of code to make pygransoOptionsAdvanced be a wrapper for pygransoOptions. But I agree this isn't super important.
Done
As talked with Ju before, we would like to leave disabling/enabling AD in future release.
I would be happy to have a zoom call to summarize things to do. Please let me know your available time
Regarding
objEvalFunction
: can we mergeobjEvalFunction
withcombinedFunction
so that there are two possible cases(just like what we did for var_spec):1. fn = combinedFunction 2. fn = [combinedFunction,objEvalFunction]
That's also a decent option. But I think you may end up changing this interface a bit for a later release when you add the extra AD and gradient features. So my only real suggestion is to pick something that is both easy and easy to change later.
I would be happy to have a zoom call to summarize things to do. Please let me know your available time
If that would helpful for you, we can do it, or we can just wait until you think things are done and I should review.
As talked with Ju before, we would like to leave disabling/enabling AD in future release.
Yes, I know. My point was that this stuff may affect the interface, so it might be worth thinking about it now, even if you don't implement it. But if you don't mind changing the interface later, then it's not a big deal.
Yes, I know. My point was that this stuff may affect the interface, so it might be worth thinking about it now, even if you don't implement it. But if you don't mind changing the interface later, then it's not a big deal.
Just talked with Ju, he agrees with you and thinks changing interface frequently can be very disturbing for users.
So for the v1.0.0 release, I'll add the disabling/enabling AD part, and try to delete the objEvalFunction
.
From Ju:
one possibility is just follow Manopt idea, if no analytic grad is specified, AD will be called. if obj and constraint functions are objects, analytic grads are fields in the obj
If that would helpful for you, we can do it, or we can just wait until you think things are done and I should review.
OK, let's discuss after I get things done
Yes, I know. My point was that this stuff may affect the interface, so it might be worth thinking about it now, even if you don't implement it. But if you don't mind changing the interface later, then it's not a big deal.
Just talked with Ju, he agrees with you and thinks changing interface frequently can be very disturbing for users.
So for the v1.0.0 release, I'll add the disabling/enabling AD part, and try to delete the
objEvalFunction
.
I was wondering if enabling/disabling AD could be done via an internal wrapper function that calls combinedFunction
. I don't know if this idea will work with PyTorch but maybe. So you might have something like [f,fg,ci,cig,ce,ceg] = wrapper(X_Struct,compute_grads)
and depending on compute_grads
, it would either enable or disable AD.
If you want to discuss ideas, I'd be happy to talk about this sometime tomorrow or the next day.
I was wondering if enabling/disabling AD could be done via an internal wrapper function that calls combinedFunction. I don't know if this idea will work with PyTorch but maybe. So you might have something like [f,fg,ci,cig,ce,ceg] = wrapper(X_Struct,compute_grads) and depending on compute_grads, it would either enable or disable AD.
Yes, good idea. I'll try to implement this tonight, and also update the objEvalFunction
part. Then we can have a zoom call tomorrow to summarize things to do before initial release. I'm available anytime after 10:00 GMT-6 (17:00 in your time) tomorrow (Dec 25 Sat).
Enabling/disabling AD has been allowed. Please check line 190 in test_cuda.py or test_cpu.py for example usage.
Generally speaking, we have following two types of user input:
[f,ci,ce] = combinedFunction(X)
, same what we have before. f,ci,ce should be computational graph on optimization variables.[f,f_grad,ci,ci_grad,ce,ce_grad] = combinedFunction(X)
, GRANSO style input, the shape of user provided gradient should be the same as what required in GRANSO.Merged combinedFunction
and objEvalFunction
. user_fn
can be one of the following two values:
1.combinedFunction
2.[combinedFunction,objEvalFunction]
: if backtracking line search has been enabled
Now, the pygranso.py interface becomes pygranso(var_spec,user_fn,user_opts=None)
This all looks like great progress, but I haven't had a chance to dive into the code yet. It might be best to look at the code changes together when we Zoom.
Update:
pygransoStruct
require_grad
from user provided functionevaluateAtX
(directly tanslated from GRANSO) will be called to update all information including gradients at the current iterationtest_cuda.py
. But it looks ugly as I haven't design the interface for getCiGradVec
. Also user has to detach all return values from computational graphsFor now, all features we discussed on Saturday has been implemented. Please let me know if there is any other thing we want to add before the initial release.
Some comments and possible todo's:
pygransoStruct.py
, I think it should be moved back to the main directory, as otherwise, users likely won't easily see it in the private folder.try
in tensor2vec.py
mean that Python has to execute the entire body of combinedFunction
before it figures out whether it returns 3 or 6 output arguments? If so, this means that the code is unnecessarily evaluating all the functions (and ADing the gradients) twice per each value of X_struct
I think. Unlike MATLAB, I don't think Python has an nargout
equivalent for getting the number of output arguments of a function, so maybe it is better to have a global AD on/off option in pygransoOptions
, which will then also determine whether combinedFunction
must return 3 or 6 output arguments. Then we can replace the try
with an if
clause and avoid doubling the computation. .detach
calls also be in the tensor2vec.py
wrapper so that overhead is eliminated as soon as possible?tensor2vec.py
since I also had input in its design, but this is not so important. Regarding the partial AD example:
requires_grad
and detach
calls themselves. Having our code automatically handle a mix of AD and explicit gradients may actually be quite difficult for us to implement. If it would be helpful to talk about the code, I do have time today to get on Zoom. Also, for the v1.0.0 release, it might be good for us to do one last walk through together, just to make sure there are no big issues we haven't noticed yet.
It seems this example is computing all gradients via AD. Is that correct? If so, maybe we should call it manual AD and point out that this can be useful for mixing AD with explicit gradients. Or we can change it to such a hybrid example. Alternatively, since hybrid AD is for advanced users, we could just eliminate documenting hybrid AD in v1.0.0 (even though v1.0.0 supports it) and push hybrid doc and examples to a later release.
No, the f_grad
is explicitly provided by user while ce_grad
is manual AD.
def comb_fn(X_struct):
q = X_struct.q
q.requires_grad_(True)
# objective function
qtY = q.T @ Y
f = 1/m * torch.norm(qtY, p = 1).item()
f_grad = 1/m*Y@torch.sign(Y.T@q) # explicitly provided by user (analytical gradients)
# inequality constraint, matrix form
ci = None
ci_grad = None
# equality constraint
ce = q.T @ q - 1
# manual AD
ce_grad = getCiGradVec(nvar=n,nconstr_ci_total=1,var_dim_map=var_in,X=X_struct,ci_vec_torch=ce,torch_device=device,double_precision=torch.double)
# return value must be detached from the computational graph
f_grad = f_grad.detach()
ce = ce.detach()
ce_grad = ce_grad.detach()
return [f,f_grad,ci,ci_grad,ce,ce_grad]
Yes, I agree to leave document for partial AD in future release
For manual/hybrid AD, I think the user will have to do all the requires_grad and detach calls themselves. Having our code automatically handle a mix of AD and explicit gradients may actually be quite difficult for us to implement.
Yes, that's what I'm doing in the code above: user has to perform requires_grad and detach calls themselves
When global AD is turned on, wouldn't it make the most sense to have the .detach calls also be in the tensor2vec.py wrapper so that overhead is eliminated as soon as possible?
When the global AD is turned on (I would set this as default as most user want global AD), the user never need to call detach. Example code for global AD:
def comb_fn(X_struct):
q = X_struct.q
# objective function
qtY = q.T @ Y
f = 1/m * torch.norm(qtY, p = 1)
# inequality constraint, matrix form
ci = None
# equality constraint
ce = pygransoStruct()
ce.c1 = q.T @ q - 1
return [f,ci,ce]
Regarding the hybrid example, my point was that by having q.requires_grad_(True)
in the very beginning, doesn't that mean that any subsequent computation involving q
has the AD overhead done in the background, even if you explicitly compute the gradients? But maybe I am wrong on this.
When the global AD is turned on (I would set this as default as most user want global AD), the user never need to call detach. Example code for global AD:
Yes, global AD should be enabled by default.
When global AD is turned on, wouldn't it make the most sense to have the .detach calls also be in the tensor2vec.py wrapper so that overhead is eliminated as soon as possible?
When the global AD is turned on (I would set this as default as most user want global AD), the user never need to call detach. Example code for global AD:
def comb_fn(X_struct): q = X_struct.q # objective function qtY = q.T @ Y f = 1/m * torch.norm(qtY, p = 1) # inequality constraint, matrix form ci = None # equality constraint ce = pygransoStruct() ce.c1 = q.T @ q - 1 return [f,ci,ce]
This may be something I don't yet understand with your port, but why is pygransoStruct
and ce.c1
needed here? Why not just ce = q.T @ q - 1
?
Regarding the hybrid example, my point was that by having q.requiresgrad(True) in the very beginning, doesn't that mean that any subsequent computation involving q has the AD overhead done in the background, even if you explicitly compute the gradients? But maybe I am wrong on this.
We can do that: before calculate analytical gradientf_grad = 1/m*Y@torch.sign(Y.T@q)
, user can add q_tmp = q.detach().clone()
and use q_tmp
for the calculation to avoid overhead
This may be something I don't yet understand with your port, but why is pygransoStruct and ce.c1 needed here? Why not just ce = q.T @ q - 1?
Some examples may have multiple constraints as we are not asking user to provide all constraints in a big matrix:
ce.c1 = ...
ce.c2 = ...
ce.c3 = ...
...
Regarding the hybrid example, my point was that by having q.requiresgrad(True) in the very beginning, doesn't that mean that any subsequent computation involving q has the AD overhead done in the background, even if you explicitly compute the gradients? But maybe I am wrong on this.
We can do that: before calculate analytical gradient
f_grad = 1/m*Y@torch.sign(Y.T@q)
, user can addq_tmp = q.detach().clone()
and useq_tmp
for the calculation to avoid overhead
Yes, that is exactly what I had in mind when we were on Zoom and I sketched out the rough idea for fixing the gradient stuff. If you update the example like this, with some comments that explain why this a key step, then it is fine to leave the hybrid example in the documentation. But I'd leave doing this to the very end because other issues are more important.
This may be something I don't yet understand with your port, but why is pygransoStruct and ce.c1 needed here? Why not just ce = q.T @ q - 1?
Some examples may have multiple constraints as we are not asking user to provide all constraints in a big matrix:
ce.c1 = ... ce.c2 = ... ce.c3 = ... ...
Ah, I see. So this is a place where you've deviated from the GRANSO interface. Was this change necessary or it was just something you and Ju preferred? Does each field, say, .c1
, have to be a scalar value or can .c1
still be a vector of individual constraints?
Some examples may have multiple constraints as we are not asking user to provide all constraints in a big matrix:
ce.c1 = ... ce.c2 = ... ce.c3 = ... ...
For this one, I have a example to explain the importance to allow multiple constraints:
obj is f(X,Y,Z)
, X,Y,Z
are all matrices
ce.c1 = c1(X)
ce.c2 = c2(Y)
ce.c3 = c3(Z)
Ah, I see. So this is a place where you've deviated from the GRANSO interface. Was this change necessary or it was just something you and Ju preferred? Does each field, say, .c1, have to be a scalar value or can .c1 still be a vector of individual constraints?
It's useful in many machine/deep learning problems, when each constraint is only on part of the variables, and multiple constraints are involved. See the simple example above.
.c1
can be scalar, vector, matrix, or even higher order tensor. As PyGRANSO treat each individual element in .c1
as 1 constraint
Hehe, you already clarified what I was just about to ask you again. ;-) Re: the ce.c1
stuff, I was just worried that it might only support scalars, which would then make something like ce.c1 = Ax - b
not possible. If that were the case, then each inequality constraint would have to be entered manually as .c1
, .c2
, etc, which would be very tedious for users. Glad to hear that it supports any sized tensor. I agree this seems like a nice interface for users and not just for ML people.
BTW, GRANSO does not use a diagonal matrix for the constraints. It uses a vector for the constraint values and then a matrix for the corresponding gradients, where the kth column is the gradient for the kth constraint.
BTW, GRANSO does not use a diagonal matrix for the constraints. It uses a vector for the constraint values and then a matrix for the corresponding gradients, where the kth column is the gradient for the kth constraint.
Yes, I edited the comment above, as I realized block diagonal matrix is used in ce_grad
when we have this kind of constraint:
ce.c1 = c1(X)
ce.c2 = c2(Y)
ce.c3 = c3(Z)
Since the user still has to call pygransoStruct.py, I think it should be moved back to the main directory, as otherwise, users likely won't easily see it in the private folder.
Done
Doesn't using atry in tensor2vec.py mean that Python has to execute the entire body of combinedFunction before it figures out whether it returns 3 or 6 output arguments? If so, this means that the code is unnecessarily evaluating all the functions (and ADing the gradients) twice per each value of X_struct I think. Unlike MATLAB, I don't think Python has an nargout equivalent for getting the number of output arguments of a function, so maybe it is better to have a global AD on/off option in pygransoOptions, which will then also determine whether combinedFunction must return 3 or 6 output arguments. Then we can replace the try with an if clause and avoid doubling the computation.
Done. globalAD
added in option
When global AD is turned on, wouldn't it make the most sense to have the .detach calls also be in the tensor2vec.py wrapper so that overhead is eliminated as soon as possible?
Done, call detach() immediately after the AD.
I guess you can add my name after yours to the copyright for tensor2vec.py since I also had input in its design, but this is not so important.
Done
Any other thing you have in mind that we need to update? If not, we may have a zoom meeting to go through all the code.
For the code, I don't think so, but it would be good to do a Zoom meeting to walk through the code together.
I guess the website will need a lot of updates with all the license, citation, and code changes. I can also help review that stuff if you let me know where to look (not sure if you have that stuff in a private repo).
The examples in the paper will also need updating but now they'll be simpler. ;-)
Yes, my plan is finalize source code first then update examples, website and paper, since we want to make sure the interface is not changed.
I'm available almost anytime today and tomorrow. Please let me know your available time.
How about now?
Sure. Invitation sent
Updated the source code (except the example folder) as we discussed in the meeting.
From Tim:
_Why are there extra inputs arguments (
var_dim_map
,nn_model
, etc) as opposed to adding them to options?_