bug in Pathfinder-128 dataset

albertfgu commented 3 years ago

Hi,

I trained a baseline CNN model on the 4 provided pathfinder datasets (32x32, 64x64, 128x128, 256x256) and it achieved good results on 32x32, 64x64, and 256x256, but only random guessing on 128x128. This seems to indicate that the 128x128 dataset has a bug. In more detail:

I extracted the provided .gz which is organized like this:

lra_release/
  listops-1000/
  pathfinder128/
  lra_release/
      listops-1000/
      tsv_data/
      pathfinder32/
      pathfinder64/
      pathfinder128/
      pathfinder256/

I created a dataset for each of the 4 pathfinderXX folders using the same processing. More specifically, for each dataset I used the 200k "curve length 14" examples with a 160k/20k/20k split. I trained a basic ResNet18 model on each of these datasets; the pre-processing transforms and model were exactly the same in all cases. This ResNet achieved 80+% validation/test accuracy on each of the 32x32, 64x64, and 256x256 datasets. On the 128x128 dataset, after training for many epochs it achieved over 95% train accuracy, but never achieved more than random guessing on validation/test.

I have no idea what is causing the issue, but this seems like the data for Pathfinder128 must be different somehow. Any leads on the issue will be greatly appreciated.

albertfgu commented 3 years ago

One question in particular is why there are 2 provided pathfinder128/ data folders. Did the authors already find this bug, and release an updated version of the dataset in the higher level folder?

However, the top-level pathfinder128/ folder is organized differently from the inner 4. In particular it has far fewer images, so it doesn't seem correct either.

alexmathfb commented 3 years ago

Thanks for sharing, currently looking into pathfinder, will keep you updated if we figure out why this happens.

albertfgu commented 3 years ago

If it helps, some more observations about performance:

The ResNet trains to 80%+ train accuracy after half an epoch or less on Pathfinder-64 and Pathfinder-256. On these datasets, validation performance tracks train very closely.
The ResNet does not make much progress on train accuracy for several epochs on Pathfinder-128. It eventually learns something, but validation performance is always random guessing.
I tried a simple variant where I took the Pathfinder-256 dataset and did mean pooling on 2x2 patches, reducing it to the same resolution as Pathfinder-128. The ResNet is able to find recover its behavior on this version (i.e. trains fast and validation tracks train).

These observations seem to indicate that Pathfinder-128 is processed differently in a way that slows learning and prevents generalization entirely. One guess I had was that the labels were random; however, I manually looked at several images/labels in the data files and they seemed correct. I also can't see any difference in the image files between this dataset and the others.

Splend1d commented 2 years ago

@albertfgu Thank you for your experiments using ResNet on PathFinder-128, I learned a lot from it. Although I might be late to the party, I think the main reason of this might just be that the PathFinder-128 is significantly harder. Judging by looking at the picture, PathFinder-256 has sparse spaces between lines, and the padding of the image is also generous. I found a critical argument in the PathFinder generator args.num_distractor_snakes in data/pathfinder.py that highlights the difference. This argument has different values for each of the map sizes: PathFinder32 -- 20/(14/3) = 4.28 PathFinder64 -- 22/(14/3) = 4.71 PathFinder128 -- 35/(14/3) = 7.50 PathFinder256 -- 30/(14/3) = 6.43 Therefore PathFinder128 has more distractor snakes than PathFinder 256 while having less space (the size of the lines are not scaled).

albertfgu commented 2 years ago

@Splend1d thanks for pointing this out. I went through and looked at pictures of examples from Pathfinder64, Pathfinder128, and Pathfinder256 and I agree that Pathfinder128 is much harder than 256 (visually - ignoring challenges of scale).

However, I am still not sure if the data is correct given the ResNet gap of 98% train to 50% test. The Pathfinder128 data is harder but not completely different from the other versions, and I don't know how to explain this lack of generalization.

With that said, assuming the data is correct, the dataset can still be argued to be "buggy" for several reasons.

First, I'm guessing the args.num_distractor_snakes argument seems to be misconfigured from what they intended. For resolution 32 / 64 / 128 / 256, the argument is 20 / 22 / 35 / 30. I'm guessing they intended this to be 20 / 22 / 25 / 30 which makes more sense
Additionally (more seriously), the actual argument used to generate the Pathfinder128 data does not match even this number. I counted the number of distractor snakes in Pathfinder64 and Pathfinder256 and got roughly 5 per image and 7 per image respectively, which matches your above calculation. However, for Pathfinder128, I got 14-16 snakes per image. This seems to imply that the distractor snake argument was actually 70 and not 35 or 25 for Pathfinder128
Problems with data generation aside, ultimately, I would argue that a sequential image classification dataset that a 2D ResNet cannot solve does not seem like a reasonable dataset.

albertfgu commented 2 years ago

Aside from issues with the args.num_distractor_snakes argument for Pathfinder128, as @Splend1d pointed out there seems to be another data generation oversight for Pathfinder256: the margins are bigger than in the other variants. I'm guessing it stems from the args.padding=1 flag which is passed to Pathfinder32/64/128 but not to 256. https://github.com/google-research/long-range-arena/blob/09c2916c3f33a07347dcc70c8839957d3c9d4062/lra_benchmarks/data/pathfinder.py#L205

albertfgu commented 2 years ago

To get around these issues, in experiments what I did was take the Pathfinder256 data, and do mean pooling on 2x2 squares to turn it to resolution 128. I originally thought that this was more or less equivalent to Pathfinder128. More importantly, I felt that it is still in the spirit of the task and checked that the original Transformer variants do not make progress on it.

In light of the above data generation issues found in Pathfinder128 and Pathfinder256, I feel less comfortable with this argument. I think this issue seriously needs the authors @vanzytay @MostafaDehghani to step in. We're at a point where people are using the LRA benchmark more extensively and where models are beginning to be able to handle Path-X, so a discussion about this dataset is necessary.

MostafaDehghani commented 2 years ago

Thanks for opening the issue and the discussion. First of all, we are aware of the difficulty of PathFinder-128 and we know that it's much more challenging than PathFinder-256. This is the reason that we use it as Path-X (instead of using PathFinder-256) in LRA.

We have generated many many variants of PathFinder and ran a lot of experiments with different model classes besides Transformers (including ResNet) and decided to include two of the setups we had for PathFinder in LRA: one that is not hard to make progress on it and one that all almost all of our models struggle to generalize. We had a lot of discussions internally and decided to add Path-X as an official LRA task to motivate a jump in the usual paradigms we were seeing in ideas for making transformers more efficient. Also I would like to argue that I totally disagree with @albertfgu on:

[...] ultimately, I would argue that a sequential image classification dataset that a 2D ResNet cannot solve does not seem like a reasonable dataset.

As a matter of the fact, the PathFinder becomes only interesting when a 2D CNN-based model fails on it, simply because CNNs struggle modeling transitivity and they don't have a direct global receptive field, which are probably key important abilities for a model to be able to solve PathFinderr. We wanted to see a new model with inductive biases that help pick up a solution in such a setup. So the config for generating PathFinder128 is designed in a way that a ResNet fails.

In the end, I want to add that although extremely difficult, it seems there are new methods that are able to find a generalizable solution for Path-X and such a development is really exciting to see for us and given that we know how hard this task is, we are impressed to see any progress on it.

albertfgu commented 2 years ago

Thanks for the response! The clarification around some of the design decisions is very helpful. This still leaves me with several questions:

The fact that the number of snakes is 20/22/35/30 instead of 20/22/25/30 still seems odd. Also, the 35 number still doesn't match the actual number of snakes in Pathfinder128. Could you confirm that the actual number of snakes (which seems to be 70) was intentional?
Overall it seems that you're saying that you purposely made Pathfinder128 much harder than Pathfinder256. Could you clarify why you chose this design choice instead of the more straightforward one of having Pathfinder 32/64/128/256 increase in difficulty and choosing Pathfinder256 as Path-X?
Above you said that the choice of including this particular dataset is to test generalization, implying that the baseline models do achieve above random train accuracy, but random test accuracy.

I would like to clarify whether this is the case: did any of your xformer variants achieve above random train accuracy? I thought that the answer was "no" based on my own experiments, as well as indicators in the paper:

Table 1 says "all models do not learn anything on Path-X... this shows that increasing the sequence length can cause seriously difficulties for model training".
the paper makes a big discussion about train-test gap on CIFAR-10, but does not mention this phenomenon at all for Path-X. Together, I assumed this indicated that all xformer variants were unable to learn anything during training; is this incorrect?

If it is the case that xformer baselines do learn on the train split, but not test, then I feel that adding this discussion about generalization to the LRA paper would substantially clarify the design choice for future researchers, and obviate the confusion raised in this thread.

If it is the case that xformer baselines do not learn on the train split, then I admit I don't quite understand why the benchmark needs such a large jump in difficulty. Towards understanding long-range dependencies, in my opinion the first question should be whether or not methods can model anything at all on sequences of length 16k (or 64k), and then the follow-up question is the one of generalization and inductive bias.

Towards this goal, a reasonable first step would be including a simpler Path-128 task of the same length, where all xformer baselines still fail to learn during training, but ResNets do solve it. Then a harder version can be included where ResNets train but do not generalize.

Ultimately, if it is true that the current version of Path-X was chosen to be so hard that even 2D ResNets cannot solve it, I think that's worth highlighting in the paper. The current language simply poses it as a longer sequence task:

This is an interesting litmus test to see if the same algorithmic challenges bear a different extent of difficulty when sequence lengths are much longer.

Given this language, it is reasonable to expect that this is the exact same version as the other PathFinders but with longer sequences; not that it is a drastically harder version that even ResNets can't solve, that conflates generalization challenges with the stated algorithmic challenges

Thanks again for continuing the discussion.

google-research / long-range-arena

bug in Pathfinder-128 dataset #38