Closed zengguangjie closed 3 years ago
Thanks for your interest in our work and for bringing this problem into my attention. There is some inconsistency for the description of the parameter k. Current neighbor_num=1 is equivalent to k=0 described in the paper. And "{chromname}{embedding_name}_nbr_1_impute.hdf5" is (0 nbr) results.
And yes, when $k=0$, the information only from 1 cell.
I'll update the code and the wiki to fix this very soon.
We will also have a major update of Higashi very soon which includes improved embedding&imputation results, runtime optimization, etc. Feel free to try the new version when it's released.
Thank you for your reply! I've done imputation on a dataset containg multiple cells, and I find the nbr=0 imputation result turns out great. Now that nbr=0 means imputation needs no information from other cells, I want to find out what will happen if I impute one cell at a time. I did this, but I got an exception:
Traceback (most recent call last): File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 867, in
embeddings_initial, attribute_dict, targets_initial = generate_attributes() File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 496, in generate_attributes a = np.load(os.path.join(temp_dir, "%s_cell_PCA.npy" % c)) File "/gs/home/zengguangjie/anaconda/envs/higashi/lib/python3.7/site-packages/numpy/lib/npyio.py", line 416, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: '/gs/home/zengguangjie/Higashi/TOKI_data/tempdir/chr1_cell_PCA.npy'
when imputing this cell using main_cell.py. Must I put multiple cell data in to get the imputation result? Is there any chance I can impute only one cell a time? Thanks!
Ah.. I can see why that happens.
When calculating node features for cell nodes, we use PCA or SVD on the contact maps for that cell. And when you only inputing one cell... It causes error for doing PCA or SVD (as you can imagine... there's no covariance matrix to begin with). Hence the error "not found cell_PCA"...
I'll adapt the code a little bit to support inputing one cell at a time in our recent release (probably this weekend to early next week). But generally speaking, Higashi performs better on larger dataset (similar to all deep learning based methods).
Thank you for your help! I'm looking forward to your update.
Hi, I'm glad to see your update about inputting one cell at a time. I tried with your updated code, but find some exceptions. I encountered an exception when running Process.py
concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/gs/home/zengguangjie/anaconda/envs/higashi/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/gs/home/zengguangjie/Higashi/Code/Process.py", line 602, in generate_feats_one temp1 = np.eye(temp1.shape[0]) UnboundLocalError: local variable 'temp1' referenced before assignment """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/gs/home/zengguangjie/Higashi/Code/Process.py", line 829, in
create_matrix() File "/gs/home/zengguangjie/Higashi/Code/Process.py", line 403, in create_matrix temp1, c = p.result() File "/gs/home/zengguangjie/anaconda/envs/higashi/lib/python3.7/concurrent/futures/_base.py", line 428, in result return self.get_result() File "/gs/home/zengguangjie/anaconda/envs/higashi/lib/python3.7/concurrent/futures/_base.py", line 384, in get_result raise self._exception UnboundLocalError: local variable 'temp1' referenced before assignment
the code is temp1 = np.eye(temp1.shape[0])
at line 602 in Process.py. I guess the second "temp1" is a mistake of "temp", so I changed it totemp1 = np.eye(temp.shape[0])
and tried again. But I encountered anoter exception.
File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 1172, in
load_first=False, save_embed=True, save_name="_stage1") File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 552, in train model, loss, training_data_generator, optimizer) File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 137, in train_epoch batch_edge_weight, batch_chrom, batch_to_neighs, y=batch_y) File "/gs/home/zengguangjie/Higashi/Code/main_cell.py", line 47, in forward_batch_hyperedge pred, pred_var, pred_proba = model(x, (batch_chrom, batch_to_neighs)) File "/gs/home/zengguangjie/anaconda/envs/higashi/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/gs/home/zengguangjie/Higashi/Code/Higashi_backend/Modules.py", line 737, in forward dynamic, static, attn = self.get_embedding(x, x_chrom) File "/gs/home/zengguangjie/Higashi/Code/Higashi_backend/Modules.py", line 708, in get_embedding raise EOFError
thanks!
I see. The second one is the "nan error". I'll test the code on a dataset with only one cell to see if that triggers the bug.
Hi, Could you try the latest version again? I can finish training on a dataset with only one cell without any error. If the error raise again, could you share the dataset and the configuration file that can reproduce the error?
Also, when inputing one cell at a time, it would be helpful to reduce the number of epochs and embedding size to prevent potential overfitting.
Hi. I tried the latest version and it workes fine.
Can you give some specific recommendations on the number of epochs and the embedding size when inputing one cell at a time? We want to see how much improvement can Higashi make on our downstream tasks tool.
Thank you for your help!
The choice of epochs can be flexible. I would say 1 for "embedding_epoch", since you are not using the embeddings of that one cell. As for the "no_nbr_epoch", you can set it as something like 10~20 as a start point, the program would terminate if the loss do not improve for 6 epochs. You can then use that as a guidance.
For embedding size, 32-128 should be good.
Thank you for your help!
Hi, I'm a little confused about the parameter neighbor_num in the JSON file. neighbor_num should control the number of neighboring cells to incorporate when making imputation. the wiki in Higashi-Usage Step3 says that
but the wiki in Output-of-Higashi-main says that
which does not agree with the wiki in Higashi-Usage and the paper. Is it a mistake expression in Output-of-Higashi-main section? can I think the
I get to be the result of 0 nbr in the paper?
another question is that when nbr=0, does the hypergraph built for GNN use only the information from one cell when imputing this cell?
Thanks!