Separate cache dataloader

joein commented 2 years ago

Create internal dataloader for caching to reduce repeated calculations.

E.g. dataloader provides batches of the following form:

[
    [anchor, positive],
    [anchor, negative_0],
    [anchor, negative_1],
    ...,
    [anchor, negative_n]
]

Assume these negative samples are random ones, then they can appear in batches with different anchors, and every time we need to calculate embeddings, and it may be really costly.

The other example is when you want to perform hard-negative mining and to do this you want to search through the whole dataset. If your dataset is small enough, you can load it into one batch. So it will be a batch of form like in the first example, but amount of negatives will be N - 1, where N is size of dataset. And you will need to calculate embeddings N^2 times and it is really expensive.

To avoid extra calculations, separate internal dataloader should be implemented, via this dataloader cache has to be filled in linear time.

generall commented 2 years ago

Hm, we can do something like flatten dataloader, which converts

[
    [anchor, positive],
    [anchor, negative_0],
    [anchor, negative_1],
    ...,
    [anchor, negative_n]
]

into

[
    anchor,
    positive,
    negative_0,
    negative_1,
    ....
]

And performs internal de-duplication with local per-thread cache.

monatis commented 2 years ago

In fact we usually perform batch-hard mining just after the forward pass and just before the backward pass. I think the issue of miners requires an extra attension because it has a potential to affect many things in the library.

qdrant / quaterion

Separate cache dataloader #17