thesis-yimeng-v2

This file is a revision of the original doc.

The current version only contains doc related to reproducing results in the chapter 4 of thesis.

$ROOT refers to repo root.

for CMUers

Everything can be found in the following places in the mind cluster.

/user_data/yimengzh/thesis-yimeng-v2
/user_data/yimengzh/strflab-python
/user_data/yimengzh/pytorch-module-in-json
/user_data/yimengzh/gaya-data
/home/yimengzh/toolchain/yimeng-thesis-v2-20200106_2c0c603d8a871cd40d99848371ad443a.simg
/user_data/yimengzh/thesis-yimeng-v2/additional_stuffs

data

raw 8K data can be found in yuanyuan_8k_neural.hdf5 and yuanyuan_8k_images.hdf5 under /user_data/yimengzh/thesis-yimeng-v2/results/datasets/raw. These two files contain recordings of six days. We used data from three days. These files are generated from source MATLAB files which are ultimately generated from raw recording data. This repo contains scripts to convert MATLAB files into the above HDF5 formats, as described below; the raw MATLAB files were generated by some spike sorting plus format conversion, and Summer/Yuanyuan should have more knowledge about the generation process.

raw NS 2250 data can be found in /user_data/yimengzh/gaya-data/data/tang/batch/final/tang_neural.npy and /user_data/yimengzh/gaya-data/data/tang/images/all_imags.npy. Hal has more knowledge about the generation process from raw recording data into the above NumPy files.

dependencies

yimeng-thesis-v2-20200106_2c0c603d8a871cd40d99848371ad443a.simg under ~/toolchain/. it can be obtained by converting the docker image available at docker pull leelabcnbc/yimeng-thesis-v2:20200106. Check Section toolchain 20200106 in the old README.
Singularity. should work on 2.6.1 as well as 3.0 version.
you need those some of the dependencies specified in $ROOT/setup_env_variables.sh. Those packages are mostly available in lab GitHub. Only the following of them are needed to reproduce the results in the paper. Click each link below for each dependency's commit that worked with this repo. Newer commits in theory should do as well.
- pytorch-module-in-json which implements the DSL for model definition.
- strflab-python for computing ccnorm.
- gaya-data needed to obtain NS 2250 data. Check with Hal on the location of the data.

reproduce results

The steps should work on the CNBC cluster (mind) and will work with single machine with some small adaptations.

All the actual computation is done inside the Singularity container.

For model training, explicit invocation of Singularity is not needed, as my code already handles that.

For everything else, the code has to run after doing the following steps.

open the container.

singularity shell --nv -B /data2/yimengzh:/my_data -B /scratch:/my_data_2 ~/toolchain/yimeng-thesis-v2-20200106_2c0c603d8a871cd40d99848371ad443a.simg

set up environment variables

cd /my_data
# note the starting `.` you can also do `source ./setup_env_variables.sh`
. ./setup_env_variables.sh

this is only needed for Jupyter notebooks.

# XXXX should be replaced by an appropriate port number.
jupyter notebook --no-browser --port=XXXX

preprocess neural data

ImageNet 8K

first, you need to download ImageNet 8K data. Run the command OUTSIDE the container.
```
$ROOT/setup_private_data.sh
```

run the following inside the container

python $ROOT/scripts/preprocessing/raw_data.py
python $ROOT/scripts/preprocessing/prepared_data.py

NS 2250

Ask Hal about it. This code repo uses Hal's code under the hood to obtain the data.

model training

All commands should run outside the container, with a basic Python 3.6+ environment without any additional dependency needed. On the CNBC cluster, such an environment can be established using scl enable rh-python36 bash.

main models (recurent and feed-forward, no ablation)

ImageNet 8K

Run the following files under $ROOT/scripts/training/yuanyuan_8k_a_3day/maskcnn_polished_with_rcnn_k_bl. These files in total may train some extra models. But these form the minimal set of files required to cover all models used in the paper.

submit_20200530.py
submit_20200530_2.py
submit_20200617.py
submit_20200704.py
submit_20200705.py
submit_20200707.py
submit_20200708.py
submit_20200709.py
submit_20200731.py
submit_20200801.py
submit_20201001.py
submit_20201012.py

NS 2250

Run the following files under $ROOT/scripts/training/gaya/maskcnn_polished_with_rcnn_k_bl. These files in total may train some extra models. But these form the minimal set of files required to cover all models used in the paper.

submit_20201002_tang.py
submit_20201018_tang.py

multi path models that correspond to recurrent models

Only 8/16/32 ch models were considered; higher ch will result in a higher frequency of OOM, making the results not very useful.

ImageNet 8K

submit_20201114.py
submit_20201118.py

NS 2250

submit_20201215_tang.py

ablated multi path models

Only 16/32 ch, 2 L models trained using all data were considered, as these models had lowest memory requirement and matched recurrent models the best.

ImageNet 8K

submit_20201205.py
submit_20201205_2.py
submit_20201213.py
submit_20201213_2.py

NS 2250

submit_20201218_tang.py

plots

check files in results_thesis

leelabcnbc / thesis-yimeng-v2

readme

thesis-yimeng-v2

for CMUers

data

dependencies

reproduce results

preprocess neural data

ImageNet 8K

NS 2250

model training

main models (recurent and feed-forward, no ablation)

ImageNet 8K

NS 2250

multi path models that correspond to recurrent models

ImageNet 8K

NS 2250

ablated multi path models

ImageNet 8K

NS 2250

plots