Memory issue with Spearman for large matrix

teddykoker / torchsort

Fast, differentiable sorting and ranking in PyTorch

https://pypi.org/project/torchsort/

Apache License 2.0

774 stars 34 forks source link

Memory issue with Spearman for large matrix #20

Closed analyticray closed 3 years ago

analyticray commented 3 years ago

I have 24K x 24K matrix that needs to compute Spearman correlation: pred -> torch.Size([24000, 24000]) target -> torch.Size([24000, 24000])

However, it gives memory warning issue and the notebook halts when I use your spearman function example:

s = spearman(pred, target)

Can you please advise on how I can compute spearman for a large matrix of this size using your library? Would appreciate it if could share the code example. Thanks.

Spearman function as the following:

import torchsort

def corrcoef(target, pred):
    pred_n = pred - pred.mean()
    target_n = target - target.mean()
    pred_n = pred_n / pred_n.norm()
    target_n = target_n / target_n.norm()
    return (pred_n * target_n).sum()

def spearman(
    target,
    pred,
    regularization="l2",
    regularization_strength=0.1,
):
    pred = torchsort.soft_rank(
        pred,
        regularization=regularization,
        regularization_strength=regularization_strength,
    )  
    return corrcoef(target, pred / pred.shape[-1])

teddykoker commented 3 years ago

What exactly are you trying to do? Each matrix will use about 24,000 24,000 (32 bits per float) = 2.3 gigabytes of space, and much more overhead is needed for sorting and intermediary storage.

Torchsort requires more space in order to be differentiable, if you do not need this you can just use this ranking function:

import torch
import torchsort
from torchsort.ops import _inv_permutation, _arange_like

def rank(x):
    _, permutation = torch.sort(x)
    # add 1 to be consistent with soft_rank
    ret = _arange_like(x) + 1.0
    return ret.gather(1, _inv_permutation(permutation))

def spearmanr(pred, target, **kw):
    pred = rank(pred)
    target = rank(target)
    pred = pred - pred.mean()
    pred = pred / pred.norm()
    target = target - target.mean()
    target = target / target.norm()
    return (pred * target).sum()

size = 24000
pred = torch.randn(size, size)
target = torch.randn(size, size)

print(spearmanr(pred, target))

This ran out of memory on my 8 GB machine, but successfully ran on my 16GB machine. I think to get the differentiable version to work you would need even more memory.

analyticray commented 3 years ago

Thank you for sharing the sample code. Yes, I am very frustrated with this memory issue. I'll try it out on the 16GB machine as you have suggested but I wish we could compute the spearman rank by loading incrementally from the hard disk so that it is scalable.

What I am trying to do is basically compute the global structure preservation (i.e., residual variance) for evaluating the dimension reduction algorithm. For example, for X(no of instances, no of features) of the original dataset and X2(no of instances, no of features) of the transformed data, and compute the spearman rank between X1 and X2.

teddykoker commented 3 years ago

I see; you could certainly load the number of instances incrementally using a smaller batch size, which should allow you to use less memory