Closed mazabou closed 3 years ago
It seems like the spike count you are embedding is too large for the embedding layer. I'm not sure why it's failing on these datasets, but you should see that src.max() > max_spikes + 2
( set ~ line 174). Could you let me know if this is the case, and make sure what's being fed into the model is in range(0, max_spikes)?
Just wanted to mention that I was not able to replicate this error. Steps I took:
Fresh conda environment via conda env create -f environment.yml
Activated the new environment via conda activate pytorch
Fresh clone of repository
Made a copy of configs/arxiv/lorenz.yaml
to configs/my_config.yaml
Edited the CHECKPOINT_DIR
specification (first line)
I didn't need to do this, but one would also want to edit the DATAPATH
specification
./scripts/train.sh my_config
Then trained for a few hundred epochs with no issues.
So more details would probably be required to address this.
Thank you both!
I rerun with a fresh environment and a fresh clone of the repository just in case, but still got the same error. I think the issue might be in the data file I am using?
I followed the same steps except that I replaced DATAPATH
, TRAIN_FILENAME
and TEST_FILENAME
to use the provided lfads_lorenz.h5
file. This is how the config file was changed:
CHECKPOINT_DIR: "runs/"
DATA:
DATAPATH: "/data"
TRAIN_FILENAME: 'lfads_lorenz.h5'
VAL_FILENAME: 'lfads_lorenz.h5'
max_spikes is equal to 23
I added these lines in src/model.py
after line 292:
def forward(self, src, mask_labels, **kwargs):
print('input src.max', src.max())
src = src.permute(1, 0, 2) # t x b x n
src = self.embedder(src) * self.scale
print('embedder output src.max', src.max())
src = self.pos_encoder(src)
And this is what I get:
input src.max 17
embedder output src.max tensor(0.7336, grad_fn=<MaxBackward1>)
I used a docker image on a different machine, so this should be independent from any of my setups:
docker run --rm -ti continuumio/anaconda3 /bin/bash
Here I install some requirements, clone the repo and create the conda env
apt-get update
apt-get install gcc python3-dev vim
# install git lfs
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs
git lfs install
cd home/
git lfs clone https://github.com/snel-repo/neural-data-transformers.git
cd neural-data-transformers/
conda env create -f environment.yml
export PYTHONPATH=.
Then I modify the lorenz configuration file as follows:
CHECKPOINT_DIR: "runs/"
DATA:
DATAPATH: "/data"
TRAIN_FILENAME: 'lfads_lorenz.h5'
VAL_FILENAME: 'lfads_lorenz.h5'
and run the training
vim configs/arxiv/lorenz.yaml
./scripts/train.sh arxiv/lorenz
I also try generating the autonomous chaotic RNN data (the generate_chaotic_rnn_data.py
file required some changes because the tensorflow version in environment.yml does not work, so I changed import tensorflow as tf
to import tensorflow.compat.v1 as tf
and also also converted ndatasets
to an int in line 133). Then I am able to generate the data
cd data/
vim chaotic_rnn/gen_synth_data_no_inputs.sh
vim chaotic_rnn/generate_chaotic_rnn_data.py
./chaotic_rnn/gen_synth_data_no_inputs.sh
cd ..
Again, I modify the configuration file to this:
CHECKPOINT_DIR: "runs/"
DATA:
DATAPATH: "data/chaotic_rnn/data_no_inputs/"
TRAIN_FILENAME: 'chaotic_rnn_no_inputs_dataset_N50_S50'
VAL_FILENAME: 'chaotic_rnn_no_inputs_dataset_N50_S50'
And run
vim configs/arxiv/chaotic.yaml
./scripts/train.sh arxiv/chaotic
And I get the same error:
(pytorch) root@2f8792cf3485:/home/neural-data-transformers# ./scripts/train.sh arxiv/chaotic
logs/chaotic exists
removing logs/chaotic
2021-04-21 17:12:16,250 Using cpu
2021-04-21 17:12:16,250 Loading chaotic_rnn_no_inputs_dataset_N50_S50 in train
/home/neural-data-transformers/src/dataset.py:47: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
rates = torch.tensor(rates)
2021-04-21 17:12:16,380 Clipping all spikes to 9.
2021-04-21 17:12:16,380 Training on 1040 samples.
2021-04-21 17:12:16,380 Loading chaotic_rnn_no_inputs_dataset_N50_S50 in val
2021-04-21 17:12:16,496 number of trainable parameters: 415040
Traceback (most recent call last):
File "src/run.py", line 144, in <module>
main()
File "src/run.py", line 58, in main
run_exp(**vars(args))
File "src/run.py", line 137, in run_exp
runner.train()
File "/home/neural-data-transformers/src/runner.py", line 335, in train
metrics = self.train_epoch()
File "/home/neural-data-transformers/src/runner.py", line 390, in train_epoch
return_outputs=False,
File "/opt/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/neural-data-transformers/src/model.py", line 294, in forward
src = self.pos_encoder(src)
File "/opt/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/neural-data-transformers/src/model.py", line 355, in forward
x = x + self.pos_embedding(self.pe) # t x 1 x d
File "/opt/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/opt/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 1724, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
Is there maybe a mismatch between the data files I am using and the configuration files?
Thank you for all this helpful info @mazabou . Could you check if your data/lfads_lorenz.h5
file is a binary file, or text?
If the latter, this is likely an issue with git lfs
not being installed - we can add the relevant instructions now, but at a high-level,
the conda environment needs git-lfs
inside the git repo you need to do git lfs install
and git lfs pull
Once I did that I was able to train a network based on the data/lfads_lorenz.h5
file, otherwise following the same steps I mentioned above. Could you check this out and get back to us?
Actually, I apologize. I see in your more detailed second set of that you did install git lfs
Hm, I apologize as I misread the initial issue. Could you set _C.MODEL.POSITION.OFFSET = False
? I think as is, the model embeds [1:T], but we might want [0:T-1].
Yes this was it, it works after setting _C.MODEL.POSITION.OFFSET = False
.
Thank you both for your help!!
Thank you for the awesome paper and for sharing the code.
I tried running
./scripts/train.sh
on both the Lorenz dataset (./data/lfads_lorenz.h5
) and the autonomous chaotic RNN dataset (generated via the script) but I get this error for both:which is called by this line:
I used Python 3.6.10 and Pytorch 1.5.0 Do you happen to know what is causing this error? Thank you.