A minimal example with toy data set

tristandeleu / pytorch-meta

A collection of extensions and data-loaders for few-shot learning & meta-learning in PyTorch

https://tristandeleu.github.io/pytorch-meta/

MIT License

1.97k stars 256 forks source link

A minimal example with toy data set #74

Closed renesax14 closed 4 years ago

renesax14 commented 4 years ago

I was trying to use the toy data-sets but when I got errors like train doesn't exist when trying to loop through the batches. Can we have a tiny minimal example to loop through the data for toy data sets?

My attempt

from torchmeta.toy import Sinusoid
#from torchmeta.datasets.helpers import omniglot
from torchmeta.utils.data import BatchMetaDataLoader

from tqdm import tqdm

num_samples_per_task = 10
dataset = Sinusoid(num_samples_per_task, num_tasks=10, noise_std=None,
    transform=None, target_transform=None, dataset_transform=None)
#dataset = omniglot("data", ways=5, shots=5, test_shots=15, meta_train=True, download=True)
dataloader = BatchMetaDataLoader(dataset, batch_size=5, num_workers=4)
print(f'len(dataset) = {len(dataset)}')
print(f'len(dataloader) = {len(dataloader)}')
for batch in dataloader:
    train_inputs, train_targets = batch["train"]

other weird things was like the tensors being of size 16 but my meta-batch size being of size 5...

tristandeleu commented 4 years ago

You should add a dataset_transform (e.g. ClassSplitter) to get a train and test dataset in batch. You can use torchmeta.toy.helpers.sinusoid, which comes with a default dataset_transform.

I cannot reproduce the tensor having size 16, I get tensors of size (5, 5, 1) as expected. Here is the modified script

from torchmeta.toy.helpers import sinusoid
from torchmeta.utils.data import BatchMetaDataLoader

dataset = sinusoid(shots=5, test_shots=5)
dataloader = BatchMetaDataLoader(dataset, batch_size=5, num_workers=4)

print(f'len(dataset) = {len(dataset)}')  # len(dataset) = 1000000
print(f'len(dataloader) = {len(dataloader)}')  # len(dataloader) = 200000

for batch in dataloader:
    train_inputs, train_targets = batch["train"]
    print(f'train_inputs.shape = {train_inputs.shape}')  # train_inputs.shape = torch.Size([5, 5, 1])
    print(f'train_targets.shape = {train_targets.shape}')  # train_targets.shape = torch.Size([5, 5, 1])
    break

renesax14 commented 4 years ago

You should add a dataset_transform (e.g. ClassSplitter) to get a train and test dataset in batch. You can use torchmeta.toy.helpers.sinusoid, which comes with a default dataset_transform.

I cannot reproduce the tensor having size 16, I get tensors of size (5, 5, 1) as expected. Here is the modified script
from torchmeta.toy.helpers import sinusoid
from torchmeta.utils.data import BatchMetaDataLoader

dataset = sinusoid(shots=5, test_shots=5)
dataloader = BatchMetaDataLoader(dataset, batch_size=5, num_workers=4)

print(f'len(dataset) = {len(dataset)}')  # len(dataset) = 1000000
print(f'len(dataloader) = {len(dataloader)}')  # len(dataloader) = 200000

for batch in dataloader:
    train_inputs, train_targets = batch["train"]
    print(f'train_inputs.shape = {train_inputs.shape}')  # train_inputs.shape = torch.Size([5, 5, 1])
    print(f'train_targets.shape = {train_targets.shape}')  # train_targets.shape = torch.Size([5, 5, 1])
    break

Thanks! :D

How does the ClassSplitter know to not form data-sets/tasks that are N-way, K-shot in the regression case? i.e. how does it guarantee that it only gets 1 function for each data-set/task D_i?

(btw the 16 was a bug on my end with jupyter remembering old stuff)

renesax14 commented 4 years ago

You should add a dataset_transform (e.g. ClassSplitter) to get a train and test dataset in batch. You can use torchmeta.toy.helpers.sinusoid, which comes with a default dataset_transform.

I find this comment confusing. I see in the helper that the Dataset.Sinusoid is passed to the ClassSplitter and not as an argument to dataset_transform (nor any of the other options that are set to none transform=None, target_transform=None, dataset_transform=None).

Is that what you meant or was that a typo since the helper never passes dataset_transform to the Sinusoid task.

renesax14 commented 4 years ago

You should add a dataset_transform (e.g. ClassSplitter) to get a train and test dataset in batch. You can use torchmeta.toy.helpers.sinusoid, which comes with a default dataset_transform.

I find this comment confusing. I see in the helper that the Dataset.Sinusoid is passed to the ClassSplitter and not as an argument to dataset_transform (nor any of the other options that are set to none transform=None, target_transform=None, dataset_transform=None).

Is that what you meant or was that a typo since the helper never passes dataset_transform to the Sinusoid task.

I think I made progress understand ur code.ClassSplitter is to get the train and test set while the actual data-set/task (wether per function or per N-way,K-shot) has already been made by the actual meta-set class.

My doubt about the fact of dataloader remains though.

tristandeleu commented 4 years ago

Data transforms (like ClassSplitter) can either be used as a data_transform, or as a wrapper (the wrapper is here just as syntactic sugar). The following two are equivalent

ClassSplitter as a dataset_transform argument


from torchmeta.toy import Sinusoid
from torchmeta.transforms import ClassSplitter

dataset = Sinusoid(num_samples_per_task=15, dataset_transform=ClassSplitter(num_train_per_class=5, num_test_per_class=10))

task = dataset.sample_task() print(task) # OrderedDict([('train', <torchmeta.utils.data.task.SubsetTask object at 0x11ba07dd8>), ('test', <torchmeta.utils.data.task.SubsetTask object at 0x11ba10240>)])


- `ClassSplitter` as a wrapper
```python
from torchmeta.toy import Sinusoid
from torchmeta.transforms import ClassSplitter

dataset = Sinusoid(num_samples_per_task=15)
dataset = ClassSplitter(dataset, num_train_per_class=5, num_test_per_class=10)

task = dataset.sample_task()
print(task)  # OrderedDict([('train', <torchmeta.utils.data.task.SubsetTask object at 0x12078eda0>), ('test', <torchmeta.utils.data.task.SubsetTask object at 0x120797208>)])

renesax14 commented 4 years ago

Data transforms (like ClassSplitter) can either be used as a data_transform, or as a wrapper (the wrapper is here just as syntactic sugar). The following two are equivalent

ClassSplitter as a dataset_transform argument

from torchmeta.toy import Sinusoid
from torchmeta.transforms import ClassSplitter

dataset = Sinusoid(num_samples_per_task=15,
    dataset_transform=ClassSplitter(num_train_per_class=5, num_test_per_class=10))

task = dataset.sample_task()
print(task)  # OrderedDict([('train', <torchmeta.utils.data.task.SubsetTask object at 0x11ba07dd8>), ('test', <torchmeta.utils.data.task.SubsetTask object at 0x11ba10240>)])

ClassSplitter as a wrapper

from torchmeta.toy import Sinusoid
from torchmeta.transforms import ClassSplitter

dataset = Sinusoid(num_samples_per_task=15)
dataset = ClassSplitter(dataset, num_train_per_class=5, num_test_per_class=10)

task = dataset.sample_task()
print(task)  # OrderedDict([('train', <torchmeta.utils.data.task.SubsetTask object at 0x12078eda0>), ('test', <torchmeta.utils.data.task.SubsetTask object at 0x120797208>)])

Quick clarification, is the input to sinusioid num_samples_per_task the same as the 600 images used for mini-imagenet per class label? e.g. does num_samples_per_task get split by class splitter by the usual 5+15 support to query set sizes?

tristandeleu commented 4 years ago

MiniImagenet does not have a num_samples_per_task argument (this is specific to toy regression datasets). But you can indeed see this as being similar to the 600 images per class: it corresponds to the number of possible examples to sample from for this task. In the case of toy regression tasks, this is simply the number of support + number of query examples (5 + 10 here).

renesax14 commented 4 years ago

MiniImagenet does not have a num_samples_per_task argument (this is specific to toy regression datasets). But you can indeed see this as being similar to the 600 images per class: it corresponds to the number of possible examples to sample from for this task. In the case of toy regression tasks, this is simply the number of support + number of query examples (5 + 10 here).

if I put 600 in the num_samples_per_task and the class splitter 5+15 I can sample more than 20 points I hope?

tristandeleu commented 4 years ago

I don't understand what you mean. Sinusoid generates samples (there is not a pool of samples/images to sample from, as opposed to datasets like MiniImagenet), so num_samples_per_task specifies the number of samples to generate per task. If you have 5 samples in your training set and 15 in the test set of your task, then you only need to generate 5 + 15 samples for this task. If you need more samples for the training/test set of the task (e.g. you have a larger number of shots), then you can specify a larger num_samples_per_task.

brando90 commented 3 years ago

dataset = sinusoid(shots=5, test_shots=5)
dataloader = BatchMetaDataLoader(dataset, batch_size=5, num_workers=4)

just to make this example complete, I believe you need to create another data loader from scratch to create meta-test and meta-val data loader. Since params are generated from scratch for each I believe they'd be disjoint and then you'd have a proper evaluations sets to evaluate your meta-learning algorithm.

see: https://github.com/tristandeleu/pytorch-meta/blob/master/torchmeta/toy/sinusoid.py

there is no explicit instructions on how to do that from the docs: https://tristandeleu.github.io/pytorch-meta/api_reference/toy/ so I assume what I said above is correct based on the code I read.

is this correct tristand? @tristandeleu

brando90 commented 3 years ago

You should add a dataset_transform (e.g. ClassSplitter) to get a train and test dataset in batch. You can use torchmeta.toy.helpers.sinusoid, which comes with a default dataset_transform.

I cannot reproduce the tensor having size 16, I get tensors of size (5, 5, 1) as expected. Here is the modified script
from torchmeta.toy.helpers import sinusoid
from torchmeta.utils.data import BatchMetaDataLoader

dataset = sinusoid(shots=5, test_shots=5)
dataloader = BatchMetaDataLoader(dataset, batch_size=5, num_workers=4)

print(f'len(dataset) = {len(dataset)}')  # len(dataset) = 1000000
print(f'len(dataloader) = {len(dataloader)}')  # len(dataloader) = 200000

for batch in dataloader:
    train_inputs, train_targets = batch["train"]
    print(f'train_inputs.shape = {train_inputs.shape}')  # train_inputs.shape = torch.Size([5, 5, 1])
    print(f'train_targets.shape = {train_targets.shape}')  # train_targets.shape = torch.Size([5, 5, 1])
    break

FYI it seems you need this:

        spt_x, spt_y, qry_x, qry_y = spt_x.float(), spt_y.float(), qry_x.float(), qry_y.float()

I tried putting it in the dataloader but couldn't do it nicely without getting lambda function pickle error or other errors

        args.criterion = nn.MSELoss()
        # tran = transforms.Compose([torch.tensor])
        # dataset = sinusoid(shots=args.k_eval, test_shots=args.k_shots, transform=tran)
        dataset = sinusoid(shots=args.k_eval, test_shots=args.k_shots)
        meta_train_dataloader = BatchMetaDataLoader(dataset, batch_size=args.meta_batch_size_train, num_workers=args.num_workers)
        meta_val_dataloader = BatchMetaDataLoader(dataset, batch_size=args.meta_batch_size_eval, num_workers=args.num_workers)
        meta_test_dataloader = BatchMetaDataLoader(dataset, batch_size=args.meta_batch_size_eval, num_workers=args.num_workers)

tristandeleu commented 3 years ago

just to make this example complete, I believe you need to create another data loader from scratch to create meta-test and meta-val data loader. Since params are generated from scratch for each I believe they'd be disjoint and then you'd have a proper evaluations sets to evaluate your meta-learning algorithm.

The parameters for generating the tasks are fixed (amplitude sampled uniformly in U(0.1, 5), phase sampled uniformly in U(0, 2\pi)), so these are proper sets for evaluation. The meta-validation/meta-test sets will contain tasks which come from this same distribution over tasks.

FYI it seems you need this:

        spt_x, spt_y, qry_x, qry_y = spt_x.float(), spt_y.float(), qry_x.float(), qry_y.float()

If I understand correctly, and based on the snippet in https://github.com/tristandeleu/pytorch-meta/issues/74#issuecomment-656905769 this is

spt_x, spt_y = batch['train']
qry_x, qry_y = batch['test']