Open juanmed opened 4 years ago
Hi Juan,
Thanks for the interest, please find the answers to you questions below:
On Fri, Mar 20, 2020 at 1:14 PM Juan notifications@github.com wrote:
Hi,
Thanks for sharing your reference implementation. I was studying your code and paper and come up with several questions I could not answer when refering back to the paper. I was wondering if I can get your feedback on them. The questions are as follows:
- What is the purpose of concatenating the index of every event in the loader?
This is done to more easily parallelize the kernel computation, which can be done the same for events from different batches.
1.
What is the trilinear kernel initialization, and how should one modify it (retrain it) when using another dataset?
In the paper we describe that we initialize the learnt kernel to a trilinear kernel which basically means that we to trilinear voting for each event in the voxel grid. If you want to use a different one you can have a look at the training loop at https://github.com/uzh-rpg/rpg_event_representation_learning/blob/0f1ba48542872f712564c554efef8e3fd64a7f76/utils/models.py#L46 where the kernel weights are trained initially to generate a trilinear profile.
1. 2.
What is C in the voxel dimensions and why is its value set to 9?
This value is the discretization of the time dimension. For a set of events, we divide the total time in C bins. In our work 9 seemed to give the best tradeoff between performance and accuracy.
1.
It seems, based on line 114 in models.py that this is the number of bins, which is referred to as B in the paper. However, line 94 of the same file has the variable B, but it is not obvious to me how the calculations performed on it represent the bin size.
In the code C refers to the number of bins and B refers to the batch size.
1.
How the C value should be changed for event record lengths greated than 100ms? I have a dataset of about 1s record lengths. If C is the number of bins, it might makes sense that a greater C is necessary for larger record lengths, right?
yes I think this makes sense.
1. 2.
What is the purpose of the crop_and_resize_to_resolution method? Given that it is the input to the classifier, it seems that its only purpose is to satisfy the requirement that the classifier's input should at least be 224x224. Is this correct?
yes this is correct. Since Resnet34 applies a fully connected layer at the end you cannot feed it arbitrary input sizes. For this reason the input needs to resized and then cropped.
1. 2.
How would you suggest to process an RGB event camera input with the learnable representation network you propose? A way that occurs to me would be to create 3 networks, one for each channel, and combine their outputs in another hidden layer whose weights should be learnt. This is of course not efficient, and does not leverage the trilinear kernel and information structure.
this is an interesting idea. I think that your suggestion makes sense. Another way to do it would be to generate three learnt voxel grids by outputting not 1 but 3 values from the learnt kernel, corresponding to the three voxel grids. Finally, these grids could be stacked together. Let me know what works best for you.
1.
I know this are several questions and hope this is not a burden, I will appreciate your hints and time.
Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uzh-rpg/rpg_event_representation_learning/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBHKKARZZPW3PKLO5AWDFTRINM3XANCNFSM4LQKUJUA .
Hi,
Thanks for your kind reply.
In the paper we describe that we initialize the learnt kernel to a trilinear kernel which basically means that we to trilinear voting for each event in the voxel grid. If you want to use a different one you can have a look at the training loop at ... where the kernel weights are trained initially to generate a trilinear profile.
Hmm I see, this is interesting. It makes clear the suggestion at some point about using a lookup table to increase speed.
yes I think this makes sense.
Then, if I increase the number of bins C
, would I need to retrain a new kernel for the new size?
I will try to make some tests and let you know if I get any interesting results.
Thanks again!
Hey,
Hello again. I continued to study your paper and was unable to answer some questions on the code and the paper by myself, and hope you can help me out again.
0) I would like to ask for your advice on what would be the best way to load my own data. I do not start from .npy
files. Instead, I have x,y,time,polarity
tensors, with np.nan
values for the x,y
coordinates where there are no events. In your code, in line96 of models.py, you create a vox
which apparently will contain the final voxel and that is filled with zeros. I have the feeling that I can exploit my tensors using this vox
, but I am not sure how.
I did tried creating my own .npy
files. This works but since event data is rather sparse, the process of eliminating all np.nan
and creating the .npy
is very slow and not memory efficient.
1) In your code, are you assuming that all event files .npy
start at time t = 0
? Your paper expressed the normalized time stamp as f = (t - t0) / delta_t
but in your code, when normalizing time, the time offset t0
is not subtracted. If you assumed this, should one subtract t0
before passing any group of events to your network?
2) Could you please explain what do the following lines of code do? Also, why the +0
in line 110? I could not wrap my head around all these lines.
3) Also, if I increase the number of bins C
, would I need to retrain a new trilinear kernel for the new size? The signature of init_kernel
makes me think that the answer is yes, but I tried the existing kernel ( whic was trained for C = 9
) with a size C = 18
and there were no errors.
Thank you for your answer!
Hi,
Thanks for your kind reply.
In the paper we describe that we initialize the learnt kernel to a trilinear kernel which basically means that we to trilinear voting for each event in the voxel grid. If you want to use a different one you can have a look at the training loop at ... where the kernel weights are trained initially to generate a trilinear profile.
Hmm I see, this is interesting. It makes clear the suggestion at some point about using a lookup table to increase speed.
yes I think this makes sense.
Then, if I increase the number of bins
C
, would I need to retrain a new kernel for the new size?I will try to make some tests and let you know if I get any interesting results.
Thanks again!
hello, where is the implementation of lookuptable?I do not find it from the code. Could it means that the input of the MLP are enumed?
Hi,
Thanks for sharing your reference implementation. I was studying your code and paper and come up with several questions I could not answer when refering back to the paper. I was wondering if I can get your feedback on them. The questions are as follows:
1) What is the purpose of concatenating the index of every event in the loader?
https://github.com/uzh-rpg/rpg_event_representation_learning/blob/0f1ba48542872f712564c554efef8e3fd64a7f76/utils/loader.py#L30
2) What is the trilinear kernel initialization, and how should one modify it (retrain it) when using another dataset?
3) What is
C
in the voxel dimensions and why is its value set to 9?https://github.com/uzh-rpg/rpg_event_representation_learning/blob/0f1ba48542872f712564c554efef8e3fd64a7f76/utils/models.py#L129
It seems, based on line 114 in
models.py
that this is the number of bins, which is referred to asB
in the paper. However, line 94 of the same file has the variableB
, but it is not obvious to me how the calculations performed on it represent the bin size.https://github.com/uzh-rpg/rpg_event_representation_learning/blob/0f1ba48542872f712564c554efef8e3fd64a7f76/utils/models.py#L94
https://github.com/uzh-rpg/rpg_event_representation_learning/blob/0f1ba48542872f712564c554efef8e3fd64a7f76/utils/models.py#L114
4) How the
C
value should be changed for event record lengths greated than 100ms? I have a dataset of about 1s record lengths. IfC
is the number of bins, it might makes sense that a greaterC
is necessary for larger record lengths, right?5) What is the purpose of the
crop_and_resize_to_resolution
method? Given that it is the input to the classifier, it seems that its only purpose is to satisfy the requirement that the classifier's input should at least be 224x224. Is this correct?6) How would you suggest to process an RGB event camera input with the learnable representation network you propose? A way that occurs to me would be to create 3 networks, one for each channel, and combine their outputs in another hidden layer whose weights should be learnt. This is of course not efficient, and does not leverage the trilinear kernel and information structure.
I know this are several questions and hope this is not a burden, I will appreciate your hints and time.
Thanks!