jameschapman19 / cca_zoo

Canonical Correlation Analysis Zoo: A collection of Regularized, Deep Learning based, Kernel, and Probabilistic methods in a scikit-learn style framework
https://cca-zoo.readthedocs.io/en/latest/
MIT License
192 stars 41 forks source link

DCCA transform method requires dataloader #109

Closed AdirRahamim closed 2 years ago

AdirRahamim commented 2 years ago

Hi,

After the recent update that uses pytorch lightning instead of deep wrapper, there is no option to pass a tuple for the transform function and it accepts only pytorch dataloader, for example, that simple code worked before:

a = np.random.randn(2000, 50)
b = np.random.randn(2000, 100)
m1 = min(a.shape[1], b.shape[1])
train_dataset = data.CCA_Dataset([a, b])
encoder_a = Encoder(latent_dims=m1, feature_size=50, layer_sizes=[128, 256])
encoder_b = Encoder(latent_dims=m1, feature_size=100, layer_sizes=[128, 256])
dcca = DCCA(latent_dims=m1, objective=objectives.CCA, encoders=[encoder_a, encoder_b])
dcca = DeepWrapper(dcca, device='cpu').fit(train_dataset, epochs=10)
U, V = dcca.transform((a, b))

Now I wrote it using pytoch lightning:

a = np.random.randn(2000, 50)
b = np.random.randn(2000, 100)
c = np.random.randn(2000, 50)
d = np.random.randn(2000, 100)
m1 = min(a.shape[1], b.shape[1])

train_dataset = data.CCA_Dataset([a, b])
val_dataset = data.CCA_Dataset([c, d])
train_loader, val_loader = get_dataloaders(train_dataset, val_dataset)

# feature_size - input, latent_dim - output
encoder_a = Encoder(latent_dims=m1, feature_size=50, layer_sizes=[128, 256])
encoder_b = Encoder(latent_dims=m1, feature_size=100, layer_sizes=[128, 256])

dcca = DCCA(latent_dims=m1, objective=objectives.CCA, encoders=[encoder_a, encoder_b])
optimizer = optim.Adam(dcca.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, 1)
dcca = CCALightning(dcca, optimizer=optimizer, lr_scheduler=scheduler)
trainer = pl.Trainer(max_epochs=2, enable_checkpointing=False, gpus=1 if torch.cuda.is_available() else 0)
trainer.fit(dcca, train_loader, val_loader)
U, V = dcca.transform((a, b))

Now raises an error. Is there an option to add again the option to pass a tuple?

jameschapman19 commented 2 years ago

In this particular example you could just do:

U, V = dcca.transform(train_loader)

In general the imagined flow would be:

new_dataset = data.CCA_Dataset([e, f]) #or your own dataset object
new_loader = get_dataloaders(new_dataset) #or your own dataloader object
U, V = dcca.transform(new_loader)

Where the first two lines are what was being called before behind the scenes. I was kind of thinking along the lines of explicit>implicit and simplifying the code base here to push errors in data stuff out of the code my end - particularly because pytorch-lightning requires you to have made the loaders for train/val anyway.

I suppose one useful change your example suggests would be if I change that utility function get_dataloaders() so that it also makes test loaders optionally? so your example would look like say:

a = np.random.randn(2000, 50)
b = np.random.randn(2000, 100)
c = np.random.randn(2000, 50)
d = np.random.randn(2000, 100)
#Assuming there is some new data unlike your example
e = np.random.randn(2000, 50)
f = np.random.randn(2000, 100)
m1 = min(a.shape[1], b.shape[1])

train_dataset = data.CCA_Dataset([a, b])
val_dataset = data.CCA_Dataset([c, d])
test_dataset = data.CCA_Dataset([e, f])
train_loader, val_loader, test_loader = get_dataloaders(train_dataset, val_dataset, test_dataset=test_dataset) #currently get_dataloaders doesn't have this test_dataset argument but it feels like a potentially good addition

# feature_size - input, latent_dim - output
encoder_a = Encoder(latent_dims=m1, feature_size=50, layer_sizes=[128, 256])
encoder_b = Encoder(latent_dims=m1, feature_size=100, layer_sizes=[128, 256])

dcca = DCCA(latent_dims=m1, objective=objectives.CCA, encoders=[encoder_a, encoder_b])
optimizer = optim.Adam(dcca.parameters(), lr=1e-3)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, 1)
dcca = CCALightning(dcca, optimizer=optimizer, lr_scheduler=scheduler)
trainer = pl.Trainer(max_epochs=2, enable_checkpointing=False, gpus=1 if torch.cuda.is_available() else 0)
trainer.fit(dcca, train_loader, val_loader)
U, V = dcca.transform(test_loader)

Any thoughts much appreciated!

AdirRahamim commented 2 years ago

In my opinion, the main problem is the inconsistency between DCCA and other methods(for example base CCA), where there you pass tuple and here you pass dataloader, which is confusing.

jameschapman19 commented 2 years ago

Yeah I agree and I initially shot for consistency but I think the overall benefits of pytorch-lightning (which required loaders) in terms of flexibility outweighed consistency for the fit() method. Then once I'd changed the fit method I'd already lost the consistency so didn't worry as much about transform() being different. Would be a simple change for me to write back if numpy arrays make them into dataloaders as above so perhaps I do that

AdirRahamim commented 2 years ago

I understand, thank you for your help! Just one more thing I couldn't understand regard DCCA, in my example simple code, is calling transform method also calls CCA on the new latent vectors(outputs of encoder_a and encoder_b), or after observing U and V I need manually to apply CCA on them(by calling CCA.fit and then transform once again)?

jameschapman19 commented 2 years ago

Good question - and one I need to address in docs!

In the transform method:

def transform(self,loader: torch.utils.data.DataLoader,train: bool = False):

If you pass train=True then it will fit a linear CCA (or linear MCCA,GCCA,TCCA where relevant) which can then be used for out of sample data.

i.e. in your example:

U, V = dcca.transform(train_loader, train=True)
U_newdata, V_newdata = dcca.transform(new_loader) #train=False by default
jameschapman19 commented 2 years ago
AdirRahamim commented 2 years ago

That explains the poor results I got so far, thanks:)

jameschapman19 commented 2 years ago

That last one is some oversight on my part as in my training loop I fit the last CCA by default at the end.

I've actually just realised pytorch-lightning has a:

def on_train_end(self, trainer, pl_module):
        print("do something when training ends")

So I will make it fit the last CCA by default in next version! But what I've described here will be functionally the same.

nienkevanunen commented 2 years ago

Hello, I would like to add unto this. I was having some issues with getting a different length than expected when transforming my data, and found out its due to the get_dataloaders() function dropping the last incomplete batch. Unfortunately, there is no option to not do that right now, so therefore to transform my data, I am making another dataloader, but this seems a bit unintuitive.

jameschapman19 commented 2 years ago

Hi @poofcakes - I agree that I should pass through more of the Dataloader arguments as options in that function and will change that.

Just to add context in case its helpful the reason why I think it's best to default to dropping the last batch is because some of the CCA objectives are unstable when using low batch sizes and I would sometimes arrive at the problem where silently all my gradients would become nan and it was just because the last batch was much smaller.

Some of the methods are robust to small batch sizes (Non linear orthogonal iterations and Barlow twins).