snel-repo / neural-data-transformers

The Unlicense
70 stars 21 forks source link

Index out of range when calling the positional encoder #1

Closed mazabou closed 3 years ago

mazabou commented 3 years ago

Thank you for the awesome paper and for sharing the code.

I tried running ./scripts/train.sh on both the Lorenz dataset (./data/lfads_lorenz.h5) and the autonomous chaotic RNN dataset (generated via the script) but I get this error for both:

    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

which is called by this line:

neural-data-transformers/src/model.py", line 294, in forward
    src = self.pos_encoder(src)

I used Python 3.6.10 and Pytorch 1.5.0 Do you happen to know what is causing this error? Thank you.

joel99 commented 3 years ago

It seems like the spike count you are embedding is too large for the embedding layer. I'm not sure why it's failing on these datasets, but you should see that src.max() > max_spikes + 2 ( set ~ line 174). Could you let me know if this is the case, and make sure what's being fed into the model is in range(0, max_spikes)?

cpandar commented 3 years ago

Just wanted to mention that I was not able to replicate this error. Steps I took:

Fresh conda environment via conda env create -f environment.yml Activated the new environment via conda activate pytorch Fresh clone of repository Made a copy of configs/arxiv/lorenz.yaml to configs/my_config.yaml Edited the CHECKPOINT_DIR specification (first line) I didn't need to do this, but one would also want to edit the DATAPATH specification ./scripts/train.sh my_config

Then trained for a few hundred epochs with no issues.

So more details would probably be required to address this.

mazabou commented 3 years ago

Thank you both! I rerun with a fresh environment and a fresh clone of the repository just in case, but still got the same error. I think the issue might be in the data file I am using? I followed the same steps except that I replaced DATAPATH, TRAIN_FILENAME and TEST_FILENAME to use the provided lfads_lorenz.h5 file. This is how the config file was changed:

CHECKPOINT_DIR: "runs/"
DATA:
  DATAPATH: "/data"
  TRAIN_FILENAME: 'lfads_lorenz.h5'
  VAL_FILENAME: 'lfads_lorenz.h5'

max_spikes is equal to 23 I added these lines in src/model.py after line 292:

    def forward(self, src, mask_labels, **kwargs):
        print('input src.max', src.max())
        src = src.permute(1, 0, 2) # t x b x n
        src = self.embedder(src) * self.scale
        print('embedder output src.max', src.max())
        src = self.pos_encoder(src)

And this is what I get:

input src.max 17
embedder output src.max tensor(0.7336, grad_fn=<MaxBackward1>)
mazabou commented 3 years ago

I used a docker image on a different machine, so this should be independent from any of my setups:

docker run --rm -ti continuumio/anaconda3 /bin/bash

Here I install some requirements, clone the repo and create the conda env

apt-get update
apt-get install gcc python3-dev vim

# install git lfs
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs
git lfs install

cd home/
git lfs clone https://github.com/snel-repo/neural-data-transformers.git
cd neural-data-transformers/
conda env create -f environment.yml
export PYTHONPATH=.

Then I modify the lorenz configuration file as follows:

CHECKPOINT_DIR: "runs/"
DATA:
  DATAPATH: "/data"
  TRAIN_FILENAME: 'lfads_lorenz.h5'
  VAL_FILENAME: 'lfads_lorenz.h5'

and run the training

vim configs/arxiv/lorenz.yaml
./scripts/train.sh arxiv/lorenz

I also try generating the autonomous chaotic RNN data (the generate_chaotic_rnn_data.py file required some changes because the tensorflow version in environment.yml does not work, so I changed import tensorflow as tf to import tensorflow.compat.v1 as tf and also also converted ndatasets to an int in line 133). Then I am able to generate the data

cd data/
vim chaotic_rnn/gen_synth_data_no_inputs.sh
vim chaotic_rnn/generate_chaotic_rnn_data.py 
./chaotic_rnn/gen_synth_data_no_inputs.sh 
cd ..

Again, I modify the configuration file to this:

CHECKPOINT_DIR: "runs/"
DATA:
  DATAPATH: "data/chaotic_rnn/data_no_inputs/"
  TRAIN_FILENAME: 'chaotic_rnn_no_inputs_dataset_N50_S50'
  VAL_FILENAME: 'chaotic_rnn_no_inputs_dataset_N50_S50'

And run

vim configs/arxiv/chaotic.yaml  
./scripts/train.sh arxiv/chaotic

And I get the same error:

(pytorch) root@2f8792cf3485:/home/neural-data-transformers# ./scripts/train.sh arxiv/chaotic
logs/chaotic exists
removing logs/chaotic
2021-04-21 17:12:16,250 Using cpu
2021-04-21 17:12:16,250 Loading chaotic_rnn_no_inputs_dataset_N50_S50 in train
/home/neural-data-transformers/src/dataset.py:47: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  rates = torch.tensor(rates)
2021-04-21 17:12:16,380 Clipping all spikes to 9.
2021-04-21 17:12:16,380 Training on 1040 samples.
2021-04-21 17:12:16,380 Loading chaotic_rnn_no_inputs_dataset_N50_S50 in val
2021-04-21 17:12:16,496 number of trainable parameters: 415040
Traceback (most recent call last):
  File "src/run.py", line 144, in <module>
    main()
  File "src/run.py", line 58, in main
    run_exp(**vars(args))
  File "src/run.py", line 137, in run_exp
    runner.train()
  File "/home/neural-data-transformers/src/runner.py", line 335, in train
    metrics = self.train_epoch()
  File "/home/neural-data-transformers/src/runner.py", line 390, in train_epoch
    return_outputs=False,
  File "/opt/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/neural-data-transformers/src/model.py", line 294, in forward
    src = self.pos_encoder(src)
  File "/opt/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/neural-data-transformers/src/model.py", line 355, in forward
    x = x + self.pos_embedding(self.pe) # t x 1 x d
  File "/opt/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/opt/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Is there maybe a mismatch between the data files I am using and the configuration files?

cpandar commented 3 years ago

Thank you for all this helpful info @mazabou . Could you check if your data/lfads_lorenz.h5 file is a binary file, or text?

If the latter, this is likely an issue with git lfs not being installed - we can add the relevant instructions now, but at a high-level,

the conda environment needs git-lfs inside the git repo you need to do git lfs install and git lfs pull

Once I did that I was able to train a network based on the data/lfads_lorenz.h5 file, otherwise following the same steps I mentioned above. Could you check this out and get back to us?

cpandar commented 3 years ago

Actually, I apologize. I see in your more detailed second set of that you did install git lfs

joel99 commented 3 years ago

Hm, I apologize as I misread the initial issue. Could you set _C.MODEL.POSITION.OFFSET = False? I think as is, the model embeds [1:T], but we might want [0:T-1].

mazabou commented 3 years ago

Yes this was it, it works after setting _C.MODEL.POSITION.OFFSET = False. Thank you both for your help!!