tanaylab / metacells

Metacells - Single-cell RNA Sequencing Analysis
MIT License
86 stars 8 forks source link

Divide and conquer issue #55

Closed SalimMegat closed 1 year ago

SalimMegat commented 1 year ago

Hi, I have been trying to figure it out the issue with divide_and_conquer but I keep getting this exception : Is it an issue with parallelization ?

Detect rare gene modules... 45%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ [01:27]

RemoteTraceback Traceback (most recent call last) RemoteTraceback: """ Traceback (most recent call last): File "/home2020/home/medecine/smegat/.conda/envs/metacells/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home2020/home/medecine/smegat/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/parallel.py", line 260, in _invocation result = PARALLEL_FUNCTION(index) File "/home2020/home/medecine/smegat/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/collect.py", line 197, in _collect_metacell assert np.sum(umis_per_gene_of_metacell) == total_umis_of_metacell AssertionError """

The above exception was the direct cause of the following exception:

AssertionError Traceback (most recent call last) File :2

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/logging.py:384, in logged..wrap..wrapper(*args, kwargs) 379 if log_value is not None: 380 logger().log( 381 param_level, "%swith %s: %s", INDENT_SPACES[: 2 INDENT_LEVEL], name, log_value 382 ) --> 384 return function(args, kwargs) 386 finally: 387 if logger().isEnabledFor(step_level):

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/divide_and_conquer.py:651, in divide_and_conquer_pipeline(adata, what, rare_max_genes, rare_max_gene_cell_fraction, rare_min_gene_maximum, rare_genes_similarity_method, rare_genes_cluster_method, rare_min_genes_of_modules, rare_min_cells_of_modules, rare_min_module_correlation, rare_min_related_gene_fold_factor, rare_max_related_gene_increase_factor, rare_min_cell_module_total, rare_max_cells_factor_of_random_pile, rare_deviants_max_cell_fraction, rare_dissolve_min_robust_size_factor, rare_dissolve_min_convincing_size_factor, rare_dissolve_min_convincing_gene_fold_factor, quick_and_dirty, select_downsample_min_samples, select_downsample_min_cell_quantile, select_downsample_max_cell_quantile, select_min_gene_total, select_min_gene_top3, select_min_gene_relative_variance, select_correction, cells_similarity_value_regularization, cells_similarity_log_data, cells_similarity_method, groups_similarity_log_data, groups_similarity_method, min_target_pile_size, max_target_pile_size, target_metacells_in_pile, cell_sizes, piles_knn_k_size_factor, piles_min_split_size_factor, piles_min_robust_size_factor, piles_max_merge_size_factor, target_metacell_size, knn_k, knn_k_size_quantile, min_knn_k, knn_balanced_ranks_factor, knn_incoming_degree_factor, knn_outgoing_degree_factor, min_seed_size_quantile, max_seed_size_quantile, candidates_knn_k_size_factor, candidates_cooldown_pass, candidates_cooldown_node, candidates_cooldown_phase, candidates_min_split_size_factor, candidates_max_merge_size_factor, candidates_min_metacell_cells, candidates_max_split_min_cut_strength, candidates_min_cut_seed_cells, must_complete_cover, deviants_policy, deviants_gap_skip_cells, deviants_min_gene_fold_factor, deviants_min_noisy_gene_fold_factor, deviants_max_gene_fraction, deviants_max_cell_fraction, deviants_max_gap_cells_count, deviants_max_gap_cells_fraction, dissolve_min_robust_size_factor, dissolve_min_convincing_size_factor, dissolve_min_convincing_gene_fold_factor, dissolve_min_metacell_cells, random_seed) 649 with ut.timed_step(".common"): 650 with ut.progress_bar_slice(common_cells_fraction): --> 651 _compute_divide_and_conquer_subset( 652 adata, 653 what, 654 prefix="common" if name is None else name + ".common", 655 metacells_level=0, 656 subset_mask=common_cells_mask, 657 collected_mask=collected_mask, 658 counts=counts, 659 dac_parameters=dac_parameters, 660 random_seed=random_seed, 661 ) 663 if rare_cells_count > 0: 664 selected_genes = ut.get_v_numpy(adata, "selected_gene")

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/logging.py:384, in logged..wrap..wrapper(*args, kwargs) 379 if log_value is not None: 380 logger().log( 381 param_level, "%swith %s: %s", INDENT_SPACES[: 2 INDENT_LEVEL], name, log_value 382 ) --> 384 return function(args, kwargs) 386 finally: 387 if logger().isEnabledFor(step_level):

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/divide_and_conquer.py:1025, in _compute_divide_and_conquer_subset(adata, what, prefix, subset_mask, collected_mask, metacells_level, counts, dac_parameters, random_seed) 1022 groups_time /= total_time 1024 with ut.progress_bar_slice(total_time): -> 1025 final_pile_of_cells = _compute_metacell_groups( 1026 adata, 1027 what, 1028 collect_time=collect_time, 1029 groups_time=groups_time, 1030 prefix=prefix + ".groups", 1031 subset_mask=subset_mask, 1032 dac_parameters=must_cover_dac_parameters, 1033 random_seed=random_seed, 1034 ) 1036 if dac_parameters.quick_and_dirty: 1037 ut.log_calc(f"# {prefix}.final")

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/logging.py:384, in logged..wrap..wrapper(*args, kwargs) 379 if log_value is not None: 380 logger().log( 381 param_level, "%swith %s: %s", INDENT_SPACES[: 2 INDENT_LEVEL], name, log_value 382 ) --> 384 return function(args, kwargs) 386 finally: 387 if logger().isEnabledFor(step_level):

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/divide_and_conquer.py:1185, in _compute_metacell_groups(adata, what, collect_time, groups_time, prefix, subset_mask, dac_parameters, random_seed) 1176 with ut.progress_bar_slice(collect_time): 1177 sdata = ut.slice( 1178 adata, 1179 name=f"{prefix}.grouped", (...) 1182 track_obs="full_cell_index", 1183 ) -> 1185 mdata = collect_metacells( 1186 sdata, 1187 what, 1188 groups=metacell_of_cells[subset_mask], 1189 name=prefix, 1190 _metacell_groups=True, 1191 top_level=False, 1192 random_seed=random_seed, 1193 ) 1195 with ut.progress_bar_slice(groups_time): 1196 metacell_sizes = ut.get_o_numpy(mdata, "grouped")

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/logging.py:384, in logged..wrap..wrapper(*args, kwargs) 379 if log_value is not None: 380 logger().log( 381 param_level, "%swith %s: %s", INDENT_SPACES[: 2 INDENT_LEVEL], name, log_value 382 ) --> 384 return function(args, kwargs) 386 finally: 387 if logger().isEnabledFor(step_level):

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/collect.py:251, in collect_metacells(adata, what, metacell_geo_mean, metacell_umis_regularization, zeros_cell_size_quantile, groups, name, prefix, top_level, _metacell_groups, random_seed) 238 ut.log_calc( 239 "fraction_per_gene_of_metacell", fraction_per_gene_of_metacell, formatter=ut.sizes_description 240 ) 242 return { 243 "grouped": grouped_of_metacell, 244 "total_umis": total_umis_of_metacell, (...) 248 "zeros_per_gene": zeros_per_gene, 249 } --> 251 results = ut.parallel_map(_collect_metacell, metacells_count) 253 fraction_per_gene_per_metacell = sp.csr_matrix(np.vstack([result["fraction_per_gene"] for result in results])) 254 assert str(fraction_per_gene_per_metacell.dtype) == "float32"

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/parallel.py:221, in parallel_map(function, invocations, max_processors, hide_from_progress_bar) 219 utm.timed_parameters(index=MAP_INDEX, processes=PROCESSES_COUNT) 220 with get_context("fork").Pool(PROCESSES_COUNT) as pool: --> 221 for index, result in pool.imap_unordered(_invocation, range(invocations)): 222 if utp.has_progress_bar() and not hide_from_progress_bar: 223 utp.did_progress(1 / invocations)

File ~/.conda/envs/metacells/lib/python3.8/multiprocessing/pool.py:868, in IMapIterator.next(self, timeout) 866 if success: 867 return value --> 868 raise value

AssertionError:

orenbenkiki commented 1 year ago

I had people run into this before - even though the data type of the UMIs is float32, the actual values have to be integers (this is a UMIs count after all), otherwise weird things happen, this is one of them. And yes, I know it is weird - we'll fix it when migrating to Daf.

SalimMegat commented 1 year ago

I am not sure to understand, I thought it was supposed to be float32 ?

orenbenkiki commented 1 year ago

1.0 may be a float, but it has an integer value. 1.1 is a float and does not have an integer value.

SalimMegat commented 1 year ago

Yes thanks but do I need to convert and round up/down the float to integers in the UMI counts ? That is what you mean ?

orenbenkiki commented 1 year ago

Well, yes. Normally, UMIs should have integer values because they are the number of RNA molecules detected, which is an integer by definition.

Fractional UMIs is something you may get if you do some batch-effect removal or noise removal or other kind of complex pre-processing. To feed the results of such a thing to metacells you'll need to get rid of the fractions.

Simple rounding isn't always the best policy. For example, if you have a cell with a 1000 genes with "0.1" UMIs in each, it would probably be sub-optimal to just round them all to 0. It may be better to "stochastically round" them so 10% of them will have 1 UMI. But the exact method is up to you.

SalimMegat commented 1 year ago

Ok thanks for the tip. I will try to figure out a way to do stochastic rounding although I have never done it before. However, in the meantime I tried some simple rounding to the nearest integers and use it as input and I still get the same error...

**

Detect rare gene modules... 45%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ [01:22]

RemoteTraceback Traceback (most recent call last) RemoteTraceback: """ Traceback (most recent call last): File "/home2020/home/medecine/smegat/.conda/envs/metacells/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home2020/home/medecine/smegat/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/parallel.py", line 260, in _invocation result = PARALLEL_FUNCTION(index) File "/home2020/home/medecine/smegat/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/collect.py", line 197, in _collect_metacell assert np.sum(umis_per_gene_of_metacell) == total_umis_of_metacell AssertionError """

The above exception was the direct cause of the following exception:

AssertionError Traceback (most recent call last) File :2

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/logging.py:384, in logged..wrap..wrapper(*args, kwargs) 379 if log_value is not None: 380 logger().log( 381 param_level, "%swith %s: %s", INDENT_SPACES[: 2 INDENT_LEVEL], name, log_value 382 ) --> 384 return function(args, kwargs) 386 finally: 387 if logger().isEnabledFor(step_level):

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/divide_and_conquer.py:651, in divide_and_conquer_pipeline(adata, what, rare_max_genes, rare_max_gene_cell_fraction, rare_min_gene_maximum, rare_genes_similarity_method, rare_genes_cluster_method, rare_min_genes_of_modules, rare_min_cells_of_modules, rare_min_module_correlation, rare_min_related_gene_fold_factor, rare_max_related_gene_increase_factor, rare_min_cell_module_total, rare_max_cells_factor_of_random_pile, rare_deviants_max_cell_fraction, rare_dissolve_min_robust_size_factor, rare_dissolve_min_convincing_size_factor, rare_dissolve_min_convincing_gene_fold_factor, quick_and_dirty, select_downsample_min_samples, select_downsample_min_cell_quantile, select_downsample_max_cell_quantile, select_min_gene_total, select_min_gene_top3, select_min_gene_relative_variance, select_correction, cells_similarity_value_regularization, cells_similarity_log_data, cells_similarity_method, groups_similarity_log_data, groups_similarity_method, min_target_pile_size, max_target_pile_size, target_metacells_in_pile, cell_sizes, piles_knn_k_size_factor, piles_min_split_size_factor, piles_min_robust_size_factor, piles_max_merge_size_factor, target_metacell_size, knn_k, knn_k_size_quantile, min_knn_k, knn_balanced_ranks_factor, knn_incoming_degree_factor, knn_outgoing_degree_factor, min_seed_size_quantile, max_seed_size_quantile, candidates_knn_k_size_factor, candidates_cooldown_pass, candidates_cooldown_node, candidates_cooldown_phase, candidates_min_split_size_factor, candidates_max_merge_size_factor, candidates_min_metacell_cells, candidates_max_split_min_cut_strength, candidates_min_cut_seed_cells, must_complete_cover, deviants_policy, deviants_gap_skip_cells, deviants_min_gene_fold_factor, deviants_min_noisy_gene_fold_factor, deviants_max_gene_fraction, deviants_max_cell_fraction, deviants_max_gap_cells_count, deviants_max_gap_cells_fraction, dissolve_min_robust_size_factor, dissolve_min_convincing_size_factor, dissolve_min_convincing_gene_fold_factor, dissolve_min_metacell_cells, random_seed) 649 with ut.timed_step(".common"): 650 with ut.progress_bar_slice(common_cells_fraction): --> 651 _compute_divide_and_conquer_subset( 652 adata, 653 what, 654 prefix="common" if name is None else name + ".common", 655 metacells_level=0, 656 subset_mask=common_cells_mask, 657 collected_mask=collected_mask, 658 counts=counts, 659 dac_parameters=dac_parameters, 660 random_seed=random_seed, 661 ) 663 if rare_cells_count > 0: 664 selected_genes = ut.get_v_numpy(adata, "selected_gene")

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/logging.py:384, in logged..wrap..wrapper(*args, kwargs) 379 if log_value is not None: 380 logger().log( 381 param_level, "%swith %s: %s", INDENT_SPACES[: 2 INDENT_LEVEL], name, log_value 382 ) --> 384 return function(args, kwargs) 386 finally: 387 if logger().isEnabledFor(step_level):

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/divide_and_conquer.py:1025, in _compute_divide_and_conquer_subset(adata, what, prefix, subset_mask, collected_mask, metacells_level, counts, dac_parameters, random_seed) 1022 groups_time /= total_time 1024 with ut.progress_bar_slice(total_time): -> 1025 final_pile_of_cells = _compute_metacell_groups( 1026 adata, 1027 what, 1028 collect_time=collect_time, 1029 groups_time=groups_time, 1030 prefix=prefix + ".groups", 1031 subset_mask=subset_mask, 1032 dac_parameters=must_cover_dac_parameters, 1033 random_seed=random_seed, 1034 ) 1036 if dac_parameters.quick_and_dirty: 1037 ut.log_calc(f"# {prefix}.final")

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/logging.py:384, in logged..wrap..wrapper(*args, kwargs) 379 if log_value is not None: 380 logger().log( 381 param_level, "%swith %s: %s", INDENT_SPACES[: 2 INDENT_LEVEL], name, log_value 382 ) --> 384 return function(args, kwargs) 386 finally: 387 if logger().isEnabledFor(step_level):

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/divide_and_conquer.py:1185, in _compute_metacell_groups(adata, what, collect_time, groups_time, prefix, subset_mask, dac_parameters, random_seed) 1176 with ut.progress_bar_slice(collect_time): 1177 sdata = ut.slice( 1178 adata, 1179 name=f"{prefix}.grouped", (...) 1182 track_obs="full_cell_index", 1183 ) -> 1185 mdata = collect_metacells( 1186 sdata, 1187 what, 1188 groups=metacell_of_cells[subset_mask], 1189 name=prefix, 1190 _metacell_groups=True, 1191 top_level=False, 1192 random_seed=random_seed, 1193 ) 1195 with ut.progress_bar_slice(groups_time): 1196 metacell_sizes = ut.get_o_numpy(mdata, "grouped")

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/logging.py:384, in logged..wrap..wrapper(*args, kwargs) 379 if log_value is not None: 380 logger().log( 381 param_level, "%swith %s: %s", INDENT_SPACES[: 2 INDENT_LEVEL], name, log_value 382 ) --> 384 return function(args, kwargs) 386 finally: 387 if logger().isEnabledFor(step_level):

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/collect.py:251, in collect_metacells(adata, what, metacell_geo_mean, metacell_umis_regularization, zeros_cell_size_quantile, groups, name, prefix, top_level, _metacell_groups, random_seed) 238 ut.log_calc( 239 "fraction_per_gene_of_metacell", fraction_per_gene_of_metacell, formatter=ut.sizes_description 240 ) 242 return { 243 "grouped": grouped_of_metacell, 244 "total_umis": total_umis_of_metacell, (...) 248 "zeros_per_gene": zeros_per_gene, 249 } --> 251 results = ut.parallel_map(_collect_metacell, metacells_count) 253 fraction_per_gene_per_metacell = sp.csr_matrix(np.vstack([result["fraction_per_gene"] for result in results])) 254 assert str(fraction_per_gene_per_metacell.dtype) == "float32"

File ~/.conda/envs/metacells/lib/python3.8/site-packages/metacells/utilities/parallel.py:221, in parallel_map(function, invocations, max_processors, hide_from_progress_bar) 219 utm.timed_parameters(index=MAP_INDEX, processes=PROCESSES_COUNT) 220 with get_context("fork").Pool(PROCESSES_COUNT) as pool: --> 221 for index, result in pool.imap_unordered(_invocation, range(invocations)): 222 if utp.has_progress_bar() and not hide_from_progress_bar: 223 utp.did_progress(1 / invocations)

File ~/.conda/envs/metacells/lib/python3.8/multiprocessing/pool.py:868, in IMapIterator.next(self, timeout) 866 if success: 867 return value --> 868 raise value

AssertionError:

**

SalimMegat commented 1 year ago

Here is the output of : cells.X <139611x17875 sparse matrix of type '<class 'numpy.int32'>' with 539728457 stored elements in Compressed Sparse Column format>

orenbenkiki commented 1 year ago

Like I said, and I know this is strange, but the data type needs to be float32 and the values need to be integers (call .astype("float32")). And you'd need to run divide-and-conquer on it and then collect the metacells.

That said, there was an issue with this assertion in 0.9.0 which I'm fixing in 0.9.1 (which we are validating now). In the meanwhile, you can just comment out this line 197 in /home2020/home/medecine/smegat/.conda/envs/metacells/lib/python3.8/site-packages/metacells/pipeline/collect.py - sorry for the inconvenience.

SalimMegat commented 1 year ago

Ok it seems to be working now. Somehow I had created my AnnData object with normalized and not raw UMI counts which explains the issue with the integer values.

Many thanks,