snel-repo / neural-data-transformers

The Unlicense
70 stars 21 forks source link

How to prepare NLB data for NDT ? #3

Closed HilbertHuangHitomi closed 3 years ago

HilbertHuangHitomi commented 3 years ago

I have successfully followed nlb_tools to read NLB datasets successfully, but I noticed NDT needs h5 files which are not the same as my h5 saves. How should I prepare the dataset I downloaded from DANDI for running NDT, please?

joel99 commented 3 years ago

cc: @felixp8 -- what's the difference b/n the files that you handed off vs the ones that the latest nlb uses?

felixp8 commented 3 years ago

I don't believe I've changed the file format since then. It looks like this function expects the keys to be 'train_data_heldin', etc., though nlb_tools uses 'train_spikes_heldin' and so on. Did you change those manually at some point @joel99?

HilbertHuangHitomi commented 3 years ago

Here's my working procedure.

  1. modify XXX_data_XXXX to XXX_spikes_XXXX in src/dataset.py.

    if 'eval_spikes_heldin' in h5dict: # NLB data
    get_key = lambda key: h5dict[key].astype(np.float32)
    train_data = get_key('train_spikes_heldin')
    train_data_fp = get_key('train_spikes_heldin_forward')
    train_data_heldout_fp = get_key('train_spikes_heldout_forward')
    train_data_all_fp = np.concatenate([train_data_fp, train_data_heldout_fp], -1)
    valid_data = get_key('eval_spikes_heldin')
    train_data_heldout = get_key('train_spikes_heldout')
    if 'eval_spikes_heldout' in h5dict:
        valid_data_heldout = get_key('eval_spikes_heldout')
    else:
        valid_data_heldout = np.zeros((valid_data.shape[0], valid_data.shape[1], train_data_heldout.shape[2]), dtype=np.float32)
    if 'eval_spikes_heldin_forward' in h5dict:
        valid_data_fp = get_key('eval_spikes_heldin_forward')
        valid_data_heldout_fp = get_key('eval_spikes_heldout_forward')
        valid_data_all_fp = np.concatenate([valid_data_fp, valid_data_heldout_fp], -1)
    else:
        valid_data_all_fp = np.zeros(
            (valid_data.shape[0], train_data_fp.shape[1], valid_data.shape[2] + valid_data_heldout.shape[2]), dtype=np.float32
        )
    
    # NLB data does not have ground truth rates
    if mode == DATASET_MODES.train:
        return train_data, None, train_data_heldout, train_data_all_fp
    elif mode == DATASET_MODES.val:
        return valid_data, None, valid_data_heldout, valid_data_all_fp
  2. use nlb_tools read nwb data and save as h5 with something like:
    train_dict = make_train_input_tensors(
    dataset,
    dataset_name  = 'mc_maze_small',
    trial_split = 'train',
    include_behavior = True,
    include_forward_pred = True,
    )
    eval_dict = make_eval_input_tensors(
    dataset,
    dataset_name  = 'mc_maze_small',
    trial_split = 'val',
    )
  3. merge them with:
    data_dict = {
    'eval_spikes_heldin'  : eval_dict['eval_spikes_heldin'],
    'eval_spikes_heldout' : eval_dict['eval_spikes_heldout'],
    'train_spikes_heldin'          : train_dict['train_spikes_heldin'],
    'train_spikes_heldout'         : train_dict['train_spikes_heldout'],
    'train_behavior'               : train_dict['train_behavior'],
    'train_spikes_heldin_forward'  : train_dict['train_spikes_heldin_forward'],
    'train_spikes_heldout_forward' : train_dict['train_spikes_heldout_forward'],
    }
    save_to_h5(data_dict, os.path.join('./data/mc_maze_small.h5')) 
  4. specify data path in ./config./mc_maze_small.yaml
    DATA:
    DATAPATH: "./data"
    TRAIN_FILENAME: 'mc_maze_small.h5'
    VAL_FILENAME: 'mc_maze_small.h5'

    However, I got the following issue:

    removing ./Results/logs/mc_maze_small
    2021-10-14 09:18:01,907 Using 1 GPUs
    2021-10-14 09:18:01,946 Using cuda:1
    2021-10-14 09:18:01,946 Loading mc_maze_small.h5 in train
    2021-10-14 09:18:02,155 Clipping all spikes to 7.
    2021-10-14 09:18:02,155 Training on 75 samples.
    2021-10-14 09:18:02,156 Loading mc_maze_small.h5 in val
    2021-10-14 09:18:10,835 number of trainable parameters: 682538
    0%|                                                                                                                                      | 0/50501 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1587428091666/work/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of add_ is deprecated:
        add_(Number alpha, Tensor other)
    Consider using one of the following signatures instead:
        add_(Tensor other, *, Number alpha)
    0%|                                                                                                                                      | 0/50501 [00:01<?, ?it/s]
    Traceback (most recent call last):
    File "src/run.py", line 144, in <module>
    main()
    File "src/run.py", line 58, in main
    run_exp(**vars(args))
    File "src/run.py", line 137, in run_exp
    runner.train()
    File "/home/username/Projects/neural-data-transformers/src/runner.py", line 341, in train
    metrics = self.train_epoch()
    File "/home/username/Projects/neural-data-transformers/src/runner.py", line 482, in train_epoch
    eval_r2 = self.neuron_r2(rates, pred_rates)
    File "/home/username/Projects/neural-data-transformers/src/runner.py", line 749, in neuron_r2
    gt, pred = self._clean_rates(gt, pred, **kwargs)
    File "/home/username/Projects/neural-data-transformers/src/runner.py", line 737, in _clean_rates
    raise Exception(f"Incompatible r2 sizes, GT: {gt.size()}, Pred: {pred.size()}")
    Exception: Incompatible r2 sizes, GT: torch.Size([25, 35, 107]), Pred: torch.Size([25, 45, 142])
  5. since nlb datasets have no gt rates, in runner I commented
    #eval_r2 = self.neuron_r2(rates, pred_rates)
    #metrics_dict['eval_r2'] = eval_r2

    Now it seems to run smoothly.