Open tanglef opened 4 years ago
Hi @tanglef,
Thanks a lot for your help, and all my apologies for the very late answer. I've been mostly busy with non-OT-related works in 2020 and am only able to get back to serious development of the GeomLoss package now that KeOps v1.5 is out and our paper on unbalanced OT has finally been submitted.
As part of my PostDoc at Imperial College, I have become familiar with the PyTorch_geometric toolbox that handles heterogeneous batches using an elegant "batch vector". The system is very much compatible with KeOps (as detailed in e.g. this diagonal_ranges
function) and lighter than padding: I intend to implement it as a standard option for both KeOps and GeomLoss.
In this context, what would be your opinion on the files that we could keep from this PR? As far a I can tell, the best option would be to adapt and keep the plot_batch_ot.py
tutorial while discarding the padding code of the samples_loss.py
file.
What do you think?
In any case, thanks again for everything, And good luck for your Master's degree! Jean
Hi @jeanfeydy !
No worries, and congrats for your work. I didn't know this existed so thanks! So using that, I wouldn't keep a lot of this PR yes. I reread the files and saw that padding became very time consuming for what I now know is actually quite small (a 3000 difference in the sizes) :sweat_smile:
The plot_batch_ot.py
might be used as a base for the new method because it's essentially a demo + benchmark.
But (imo) the best would simply be to close the PR (as its main goal will take an entirely new direction) and (maybe) use plot_batch_ot.py
if needed. Considering there's the keyword "padding" + the number of the issue in the title, it won't be hard to find anyway.
Thanks a lot! And good luck to you with this feature
Hi @jeanfeydy and @tanglef. As I understand, this PR makes it much slower, is that correct? What would be your recommendation for dealing with a batch of input sequences of different lengths? This use case is ubiquitous in NLP or speech processing applications, so having support for it could be highly influential I believe.
Makes it easier to work with batchs of different sizes (using lists) and an example of how to use it in the doc (with benchmark)