ma-compbio / Higashi

single-cell Hi-C, scHi-C, Hi-C, 3D genome, nuclear organization, hypergraph
MIT License
79 stars 10 forks source link

ValueError: Found array with 1 feature(s) (shape=(250, 1)) while a minimum of 2 is required by TruncatedSVD #26

Closed Samfouss closed 9 months ago

Samfouss commented 1 year ago

Hi! I have simulated mouse data and I would like to perform a cell clustering using Higachi program. But I always get this error when running the program. It seems like Temp objects do not contain any data.

generating start/end dict for chromosome extracting from data.txt 100%|████████████████████████████████████████████████████████████████████████████████| 39410250/39410250 [01:58<00:00, 332438.27it/s] generating contact maps for baseline data loaded 250 False creating matrices tasks: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.19it/s] total_feats_size 168 0%| | 0/1 [00:00<?, ?it/s]Done here 1 -1 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/fosam16/owner/Higashi/higashi/Higashi_wrapper.py", line 459, in process_data self.create_matrix() File "/home/fosam16/owner/Higashi/higashi/Higashi_wrapper.py", line 492, in create_matrix create_matrix(self.config) File "/home/fosam16/owner/Higashi/higashi/Process.py", line 717, in create_matrix temp1, c = generate_feats_one(temp[0], temp[1], size, length, c, qc_list[c]) File "/home/fosam16/owner/Higashi/higashi/Process.py", line 971, in generate_feats_one temp1 = TruncatedSVD(n_components=2, algorithm='randomized', n_iter=2).fit_transform(temp1) File "/home/fosam16/owner/myvenvhig/lib/python3.8/site-packages/sklearn/decomposition/_truncated_svd.py", line 218, in fit_transform X = self._validate_data(X, accept_sparse=["csr", "csc"], ensure_min_features=2) File "/home/fosam16/owner/myvenvhig/lib/python3.8/site-packages/sklearn/base.py", line 577, in _validate_data X = check_array(X, input_name="X", **check_params) File "/home/fosam16/owner/myvenvhig/lib/python3.8/site-packages/sklearn/utils/validation.py", line 918, in check_array raise ValueError( ValueError: Found array with 1 feature(s) (shape=(250, 1)) while a minimum of 2 is required by TruncatedSVD.

Please can you help me to figure out what I am doing wrong or what's the problem ? Here is all about config file and mousse cells data

ruochiz commented 1 year ago

Hey, I think it's likely caused by choosing a too coarse resolution (6000000). To make it easier for the user, we implement some heuristic in Higashi to decide the feature dimension, model size etc. based on the genome reference size and typical resolutions that are used for analysis. It's likely that at the resolution of 6,000,000, one of the suggested dim by the model becomes 1, and leads to this error. I would suggest to change it to 1Mb as a start point. If the problem persists, I'll take another look. If for some reason that it's necessary to use the 6Mb resolution, I can add a fix to the code to avoid this issue.

Samfouss commented 1 year ago

Thank you for responding to my concern. Your response was very helpful. I take 1Mb as as suggested. But now I am getting some message like : "The 0 th chrom in your chrom_list has no sample in this generator". Can you help me to understand what does it mean ? I am working on chr11 as specified in the configuration file. Do you think that it is caused by the sparsity of my data (total_sparsity_cell 0.00040311744154797097) ? Because I noticed that in your tutorial (Higashi/tutorials/4DN_sci-Hi-C_Kim et al.ipynb), you get something like total_sparsity_cell 0.012761184803150997.

`>>> from higashi.Higashi_wrapper import *

config = "config_mousse.JSON"

config = "config_souris.JSON" print("1. Config finished")

  1. Config finished

Initialize the Higashi instance

higashi_model = Higashi(config)

Data processing (only needs to be run for once)

higashi_model.process_data() generating start/end dict for chromosome extracting from data.txt 100%|████████████████████████████████████████████████████████████████████████████████| 39410250/39410250 [01:56<00:00, 337588.28it/s] generating contact maps for baseline data loaded 750 False creating matrices tasks: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.07it/s] total_feats_size 200 0%| | 0/1 [00:00<?, ?it/s]Done here 1 1 Done here 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 67.57it/s] higashi_model.prep_model() cpu_num 32 training on data from: ['chr11'] total_sparsity_cell 0.00040311744154797097 no contractive loss batch_size 256 Node type num [250 122] [250 372] start making attribute 0.994: 32%|███████████████████████████▊ | 96/300 [00:00<00:00, 433.01it/s] loss 0.9697239995002747 loss best 0.9167578220367432 epochs 96

initializing data generator 0%| | 0/1 [00:00<?, ?it/s] The 0 th chrom in your chrom_list has no sample in this generator 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7194.35it/s] initializing data generator 0%| | 0/1 [00:00<?, ?it/s] The 0 th chrom in your chrom_list has no sample in this generator 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7752.87it/s]

print("2. Process finished")

  1. Process finished`
ruochiz commented 1 year ago

Um. This error is raised when there are no hyperedges to train the model. For debug purpose could you run the following script?

temp_dir = ...
with h5py.File(os.path.join(temp_dir, "node_feats.hdf5"), "r") as input_f:
    print(len(np.array(input_f['train_data_%s' % "chr11"]).astype('int'))))

Also, what's your minimum and maximum distance in the config files.

Thanks.

Samfouss commented 1 year ago

In my config file, I have : "minimum_impute_distance": 0, "maximum_impute_distance": -1

The fact that there are no hyperedges to train the model is probably related to my cell data. Maybe they are too sparse.

Samfouss commented 1 year ago

So may be I have to increase this two distances

ruochiz commented 1 year ago

That seems to be using all the edges, so I think these two parameters are probably fine. Also, I mean the minimum_distance not minimum_impute_distance.

could you try to run the code I provided above to see if there are any edges before the filtering step. That could help to narrow down where the problem is (too few reads, reads are mostly short-ranged interactions, etc. )

Thanks!

Samfouss commented 1 year ago

When I run the code above, that is what I get.

`>>> temp_dir = ...

with h5py.File(os.path.join(temp_dir, "node_feats.hdf5"), "r") as input_f: ... print(len(np.array(input_f['traindata%s' % "chr11"]).astype('int'))) ... Traceback (most recent call last): File "", line 1, in File "/cvmfs/samfouss/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/posixpath.py", line 76, in join a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not ellipsis`

Samfouss commented 1 year ago

And also now I do not know what happened but I can run "higashi_model.prep_model()" command without any problem. `

from higashi.Higashi_wrapper import *

Set the path to the configuration file, change it accordingly

config = "config_mousse.JSON"

config = "config_souris.JSON" print("1. Config finished")

  1. Config finished

Initialize the Higashi instance

higashi_model = Higashi(config)

Data processing (only needs to be run for once)

higashi_model.process_data() generating start/end dict for chromosome extracting from data.txt 100%|████████████████████████████████████████████████████████████████████████████████| 39410250/39410250 [01:58<00:00, 333620.38it/s] generating contact maps for baseline data loaded 2831250 False creating matrices tasks: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [15:21<00:00, 921.79s/it]

total_feats_size 200 0%| | 0/1 [00:00<?, ?it/s]Done here 1 149 Done here 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.30it/s]

higashi_model.prep_model() cpu_num 32 training on data from: ['chr11'] total_sparsity_cell 0.00040311744154797097 no contractive loss batch_size 1280 Node type num [ 250 12185] [ 250 12435] start making attribute 0.636: 100%|██████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:01<00:00, 213.18it/s] loss 0.6364461779594421 loss best 0.6372790932655334 epochs 299

initializing data generator 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25731.93it/s] initializing data generator 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 27962.03it/s]

print("2. Process finished")

  1. Process finished `

But "higashi_model.train_for_embeddings()" takes a lot of time to execute.

GMFranceschini commented 9 months ago

I am getting the same error, tried 1Mb and 100Kb to no avail. I have sparsity 0.22241997971104638 which should be quite good, do you have any suggestion on how to debug this? In the temp_dir I can't find a node_feats.hdf5 file to dump as you suggested.

ruochiz commented 9 months ago

Hey, what version of Higashi are you using. Is it the one from conda or the github + pip install.

GMFranceschini commented 9 months ago

I installed downloading the repo and using setup.py. From my conda env export I see:

name: fasthigashi
  - fasthigashi=0.1.1=py_0
ruochiz commented 9 months ago

I see. And to confirm, the error is: ValueError: Found array with 1 feature(s) (shape=(250, 1)) while a minimum of 2 is required by TruncatedSVD

GMFranceschini commented 9 months ago

Yes, of course with a different n for my data

'ValueError: Found array with 1 feature(s) (shape=(69, 1)) while a minimum of 2 is required by TruncatedSVD.'
ruochiz commented 9 months ago

To help with debugging:

  1. Are you working with a custom genome or standard ones (like hg38 / mm10)
  2. Are there any chromosomes with length smaller than 1Mb in the dataset?
  3. under the temp_dir, there should be some files with name "cell_adj_%s.npy" in it, could you load one of them and print out the shape?

Thanks!

GMFranceschini commented 9 months ago

Thanks to you! You may already have found the problem.

hg19, but filtered out the uncharacterized chromosomes. However, chrM was still there. I removed it and removed all entries with it in the pairs. Now the SVD step works. Maybe a small addition to the documentation could address that only chr1:chrX/Y should be included, that's what I should have done in the first place!

Thank you for the help, and feel free to close this!

ruochiz commented 9 months ago

I see sounds good. Will update the documentation.