ma-compbio / Higashi

single-cell Hi-C, scHi-C, Hi-C, 3D genome, nuclear organization, hypergraph
MIT License
78 stars 11 forks source link

Abount parameter neighbor_num #8

Closed zengguangjie closed 3 years ago

zengguangjie commented 3 years ago

Hi, I'm a little confused about the parameter neighbor_num in the JSON file. neighbor_num should control the number of neighboring cells to incorporate when making imputation. the wiki in Higashi-Usage Step3 says that

  1. Train Higashi with cell-dependent GNN, but with k=0

but the wiki in Output-of-Higashi-main says that

The {k} can be either 1 or the {neighbor_num} parameter specified in the configuration file. When {k}=1, it represents the imputation results without using any neighboring cell information.

which does not agree with the wiki in Higashi-Usage and the paper. Is it a mistake expression in Output-of-Higashi-main section? can I think the

{chromname}{embedding_name}_nbr_1_impute.hdf5

I get to be the result of 0 nbr in the paper?

another question is that when nbr=0, does the hypergraph built for GNN use only the information from one cell when imputing this cell?

Thanks!

ruochiz commented 3 years ago

Thanks for your interest in our work and for bringing this problem into my attention. There is some inconsistency for the description of the parameter k. Current neighbor_num=1 is equivalent to k=0 described in the paper. And "{chromname}{embedding_name}_nbr_1_impute.hdf5" is (0 nbr) results.

And yes, when $k=0$, the information only from 1 cell.

I'll update the code and the wiki to fix this very soon.

We will also have a major update of Higashi very soon which includes improved embedding&imputation results, runtime optimization, etc. Feel free to try the new version when it's released.

zengguangjie commented 3 years ago

Thank you for your reply! I've done imputation on a dataset containg multiple cells, and I find the nbr=0 imputation result turns out great. Now that nbr=0 means imputation needs no information from other cells, I want to find out what will happen if I impute one cell at a time. I did this, but I got an exception:

Traceback (most recent call last): File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 867, in embeddings_initial, attribute_dict, targets_initial = generate_attributes() File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 496, in generate_attributes a = np.load(os.path.join(temp_dir, "%s_cell_PCA.npy" % c)) File "/gs/home/zengguangjie/anaconda/envs/higashi/lib/python3.7/site-packages/numpy/lib/npyio.py", line 416, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: '/gs/home/zengguangjie/Higashi/TOKI_data/tempdir/chr1_cell_PCA.npy'

when imputing this cell using main_cell.py. Must I put multiple cell data in to get the imputation result? Is there any chance I can impute only one cell a time? Thanks!

ruochiz commented 3 years ago

Ah.. I can see why that happens.

When calculating node features for cell nodes, we use PCA or SVD on the contact maps for that cell. And when you only inputing one cell... It causes error for doing PCA or SVD (as you can imagine... there's no covariance matrix to begin with). Hence the error "not found cell_PCA"...

I'll adapt the code a little bit to support inputing one cell at a time in our recent release (probably this weekend to early next week). But generally speaking, Higashi performs better on larger dataset (similar to all deep learning based methods).

zengguangjie commented 3 years ago

Thank you for your help! I'm looking forward to your update.

zengguangjie commented 3 years ago

Hi, I'm glad to see your update about inputting one cell at a time. I tried with your updated code, but find some exceptions. I encountered an exception when running Process.py

concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/gs/home/zengguangjie/anaconda/envs/higashi/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/gs/home/zengguangjie/Higashi/Code/Process.py", line 602, in generate_feats_one temp1 = np.eye(temp1.shape[0]) UnboundLocalError: local variable 'temp1' referenced before assignment """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/gs/home/zengguangjie/Higashi/Code/Process.py", line 829, in create_matrix() File "/gs/home/zengguangjie/Higashi/Code/Process.py", line 403, in create_matrix temp1, c = p.result() File "/gs/home/zengguangjie/anaconda/envs/higashi/lib/python3.7/concurrent/futures/_base.py", line 428, in result return self.get_result() File "/gs/home/zengguangjie/anaconda/envs/higashi/lib/python3.7/concurrent/futures/_base.py", line 384, in get_result raise self._exception UnboundLocalError: local variable 'temp1' referenced before assignment

the code is temp1 = np.eye(temp1.shape[0])at line 602 in Process.py. I guess the second "temp1" is a mistake of "temp", so I changed it totemp1 = np.eye(temp.shape[0]) and tried again. But I encountered anoter exception.

File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 1172, in load_first=False, save_embed=True, save_name="_stage1") File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 552, in train model, loss, training_data_generator, optimizer) File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 137, in train_epoch batch_edge_weight, batch_chrom, batch_to_neighs, y=batch_y) File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 47, in forward_batch_hyperedge pred, pred_var, pred_proba = model(x, (batch_chrom, batch_to_neighs)) File "/gs/home/zengguangjie/anaconda/envs/higashi/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/gs/home/zengguangjie/Higashi/Code/Higashi_backend/Modules.py", line 737, in forward dynamic, static, attn = self.get_embedding(x, x_chrom) File "/gs/home/zengguangjie/Higashi/Code/Higashi_backend/Modules.py", line 708, in get_embedding raise EOFError

thanks!

ruochiz commented 3 years ago

I see. The second one is the "nan error". I'll test the code on a dataset with only one cell to see if that triggers the bug.

ruochiz commented 3 years ago

Hi, Could you try the latest version again? I can finish training on a dataset with only one cell without any error. If the error raise again, could you share the dataset and the configuration file that can reproduce the error?

Also, when inputing one cell at a time, it would be helpful to reduce the number of epochs and embedding size to prevent potential overfitting.

zengguangjie commented 3 years ago

Hi. I tried the latest version and it workes fine. Can you give some specific recommendations on the number of epochs and the embedding size when inputing one cell at a time? We want to see how much improvement can Higashi make on our downstream tasks tool.
Thank you for your help!

ruochiz commented 3 years ago

The choice of epochs can be flexible. I would say 1 for "embedding_epoch", since you are not using the embeddings of that one cell. As for the "no_nbr_epoch", you can set it as something like 10~20 as a start point, the program would terminate if the loss do not improve for 6 epochs. You can then use that as a guidance.

For embedding size, 32-128 should be good.

zengguangjie commented 3 years ago

Thank you for your help!