Closed Samfouss closed 9 months ago
Hey, I think it's likely caused by choosing a too coarse resolution (6000000). To make it easier for the user, we implement some heuristic in Higashi to decide the feature dimension, model size etc. based on the genome reference size and typical resolutions that are used for analysis. It's likely that at the resolution of 6,000,000, one of the suggested dim by the model becomes 1, and leads to this error. I would suggest to change it to 1Mb as a start point. If the problem persists, I'll take another look. If for some reason that it's necessary to use the 6Mb resolution, I can add a fix to the code to avoid this issue.
Thank you for responding to my concern. Your response was very helpful. I take 1Mb as as suggested. But now I am getting some message like : "The 0 th chrom in your chrom_list has no sample in this generator". Can you help me to understand what does it mean ? I am working on chr11 as specified in the configuration file. Do you think that it is caused by the sparsity of my data (total_sparsity_cell 0.00040311744154797097) ? Because I noticed that in your tutorial (Higashi/tutorials/4DN_sci-Hi-C_Kim et al.ipynb), you get something like total_sparsity_cell 0.012761184803150997.
`>>> from higashi.Higashi_wrapper import *
config = "config_mousse.JSON"
config = "config_souris.JSON" print("1. Config finished")
- Config finished
Initialize the Higashi instance
higashi_model = Higashi(config)
Data processing (only needs to be run for once)
higashi_model.process_data() generating start/end dict for chromosome extracting from data.txt 100%|████████████████████████████████████████████████████████████████████████████████| 39410250/39410250 [01:56<00:00, 337588.28it/s] generating contact maps for baseline data loaded 750 False creating matrices tasks: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.07it/s] total_feats_size 200 0%| | 0/1 [00:00<?, ?it/s]Done here 1 1 Done here 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 67.57it/s] higashi_model.prep_model() cpu_num 32 training on data from: ['chr11'] total_sparsity_cell 0.00040311744154797097 no contractive loss batch_size 256 Node type num [250 122] [250 372] start making attribute 0.994: 32%|███████████████████████████▊ | 96/300 [00:00<00:00, 433.01it/s] loss 0.9697239995002747 loss best 0.9167578220367432 epochs 96
initializing data generator 0%| | 0/1 [00:00<?, ?it/s] The 0 th chrom in your chrom_list has no sample in this generator 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7194.35it/s] initializing data generator 0%| | 0/1 [00:00<?, ?it/s] The 0 th chrom in your chrom_list has no sample in this generator 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7752.87it/s]
print("2. Process finished")
- Process finished`
Um. This error is raised when there are no hyperedges to train the model. For debug purpose could you run the following script?
temp_dir = ...
with h5py.File(os.path.join(temp_dir, "node_feats.hdf5"), "r") as input_f:
print(len(np.array(input_f['train_data_%s' % "chr11"]).astype('int'))))
Also, what's your minimum and maximum distance in the config files.
Thanks.
In my config file, I have : "minimum_impute_distance": 0, "maximum_impute_distance": -1
The fact that there are no hyperedges to train the model is probably related to my cell data. Maybe they are too sparse.
So may be I have to increase this two distances
That seems to be using all the edges, so I think these two parameters are probably fine. Also, I mean the minimum_distance not minimum_impute_distance.
could you try to run the code I provided above to see if there are any edges before the filtering step. That could help to narrow down where the problem is (too few reads, reads are mostly short-ranged interactions, etc. )
Thanks!
When I run the code above, that is what I get.
`>>> temp_dir = ...
with h5py.File(os.path.join(temp_dir, "node_feats.hdf5"), "r") as input_f: ... print(len(np.array(input_f['traindata%s' % "chr11"]).astype('int'))) ... Traceback (most recent call last): File "
", line 1, in File "/cvmfs/samfouss/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/posixpath.py", line 76, in join a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not ellipsis`
And also now I do not know what happened but I can run "higashi_model.prep_model()" command without any problem. `
from higashi.Higashi_wrapper import *
Set the path to the configuration file, change it accordingly
config = "config_mousse.JSON"
config = "config_souris.JSON" print("1. Config finished")
- Config finished
Initialize the Higashi instance
higashi_model = Higashi(config)
Data processing (only needs to be run for once)
higashi_model.process_data() generating start/end dict for chromosome extracting from data.txt 100%|████████████████████████████████████████████████████████████████████████████████| 39410250/39410250 [01:58<00:00, 333620.38it/s] generating contact maps for baseline data loaded 2831250 False creating matrices tasks: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [15:21<00:00, 921.79s/it]
total_feats_size 200 0%| | 0/1 [00:00<?, ?it/s]Done here 1 149 Done here 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.30it/s]
higashi_model.prep_model() cpu_num 32 training on data from: ['chr11'] total_sparsity_cell 0.00040311744154797097 no contractive loss batch_size 1280 Node type num [ 250 12185] [ 250 12435] start making attribute 0.636: 100%|██████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:01<00:00, 213.18it/s] loss 0.6364461779594421 loss best 0.6372790932655334 epochs 299
initializing data generator 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25731.93it/s] initializing data generator 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 27962.03it/s]
print("2. Process finished")
- Process finished `
But "higashi_model.train_for_embeddings()" takes a lot of time to execute.
I am getting the same error, tried 1Mb and 100Kb to no avail.
I have sparsity 0.22241997971104638 which should be quite good, do you have any suggestion on how to debug this? In the temp_dir
I can't find a node_feats.hdf5
file to dump as you suggested.
Hey, what version of Higashi are you using. Is it the one from conda or the github + pip install.
I installed downloading the repo and using setup.py. From my conda env export I see:
name: fasthigashi
- fasthigashi=0.1.1=py_0
I see. And to confirm, the error is: ValueError: Found array with 1 feature(s) (shape=(250, 1)) while a minimum of 2 is required by TruncatedSVD
Yes, of course with a different n for my data
'ValueError: Found array with 1 feature(s) (shape=(69, 1)) while a minimum of 2 is required by TruncatedSVD.'
To help with debugging:
"cell_adj_%s.npy"
in it, could you load one of them and print out the shape? Thanks!
Thanks to you! You may already have found the problem.
hg19
, but filtered out the uncharacterized chromosomes. However, chrM
was still there.
I removed it and removed all entries with it in the pairs. Now the SVD step works. Maybe a small addition to the documentation could address that only chr1:chrX/Y should be included, that's what I should have done in the first place!
Thank you for the help, and feel free to close this!
I see sounds good. Will update the documentation.
Hi! I have simulated mouse data and I would like to perform a cell clustering using Higachi program. But I always get this error when running the program. It seems like Temp objects do not contain any data.
generating start/end dict for chromosome extracting from data.txt 100%|████████████████████████████████████████████████████████████████████████████████| 39410250/39410250 [01:58<00:00, 332438.27it/s] generating contact maps for baseline data loaded 250 False creating matrices tasks: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.19it/s] total_feats_size 168 0%| | 0/1 [00:00<?, ?it/s]Done here 1 -1 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/fosam16/owner/Higashi/higashi/Higashi_wrapper.py", line 459, in process_data self.create_matrix() File "/home/fosam16/owner/Higashi/higashi/Higashi_wrapper.py", line 492, in create_matrix create_matrix(self.config) File "/home/fosam16/owner/Higashi/higashi/Process.py", line 717, in create_matrix temp1, c = generate_feats_one(temp[0], temp[1], size, length, c, qc_list[c]) File "/home/fosam16/owner/Higashi/higashi/Process.py", line 971, in generate_feats_one temp1 = TruncatedSVD(n_components=2, algorithm='randomized', n_iter=2).fit_transform(temp1) File "/home/fosam16/owner/myvenvhig/lib/python3.8/site-packages/sklearn/decomposition/_truncated_svd.py", line 218, in fit_transform X = self._validate_data(X, accept_sparse=["csr", "csc"], ensure_min_features=2) File "/home/fosam16/owner/myvenvhig/lib/python3.8/site-packages/sklearn/base.py", line 577, in _validate_data X = check_array(X, input_name="X", **check_params) File "/home/fosam16/owner/myvenvhig/lib/python3.8/site-packages/sklearn/utils/validation.py", line 918, in check_array raise ValueError( ValueError: Found array with 1 feature(s) (shape=(250, 1)) while a minimum of 2 is required by TruncatedSVD.
Please can you help me to figure out what I am doing wrong or what's the problem ? Here is all about config file and mousse cells data