ma-compbio / Higashi

single-cell Hi-C, scHi-C, Hi-C, 3D genome, nuclear organization, hypergraph
MIT License
79 stars 10 forks source link

Problem on running FastHigashi on Ramani et al. dataset #22

Closed sdontsay closed 2 years ago

sdontsay commented 2 years ago

Hi Ruochi! I used Higashi on the Ramani et al. data you provided in your tutorial before, so I am trying to run the same dataset with FastHigashi again to see the improvement. I copied the first several lines of codes you provided in the Lee et al. tutorial, here below is the code for running FastHigashi,

Code

from higashi.Higashi_wrapper import from fasthigashi.FastHigashi_Wrapper import config = '.../ram_data/config_Ramani_1m.JSON' higashi_model = Higashi(config)

higashi_model.process_data()

Initialize the model

fh_model = FastHigashi(config_path=config, path2input_cache=".../ram_fast", path2result_dir=".../ram_fast", off_diag=100, # 0-100th diag of the contact maps would be used. filter=False, # fit the model on high quality cells, transform the rest do_conv=False,# linear convolution imputation do_rwr=False, # partial random walk with restart imputation do_col=False, # sqrt_vc normalization no_col=False) # force to not do sqrt_vc normalization

Pack from sparse mtx to tensors

fh_model.prep_dataset()

fh_model.run_model(dim1=.6, rank=256, n_iter_parafac=1, extra="")

And next is the content of the config file I provided to the program,

Config file

config_info = { "data_dir": ".../ram_data", "input_format": 'higashi_v1', "temp_dir": ".../ram_fast", "genome_reference_path": ".../4DN_data/hg19.chrom.sizes.txt", "cytoband_path": ".../4DN_data/cytoBand.txt", "chrom_list": ["chr1", "chr2","chr3","chr4","chr5", "chr6","chr7","chr8","chr9","chr10", "chr11","chr12","chr13","chr14","chr15", "chr16","chr17","chr18","chr19","chr20", "chr21","chr22","chrX"], "resolution": 1000000, "resolution_cell": 1000000, "resolution_fh": [1000000], "embedding_name": "exp_zinb3", "minimum_distance": 2000000, "maximum_distance": -1, "local_transfer_range": 1, "loss_mode": "zinb", "dimensions": 64, "impute_list":["chr1", "chr2","chr3","chr4","chr5", "chr6","chr7","chr8","chr9","chr10", "chr11","chr12","chr13","chr14","chr15", "chr16","chr17","chr18","chr19","chr20", "chr21","chr22","chrX"], "neighbor_num": 4, "cpu_num": 10, "gpu_num": 2, "embedding_epoch":60, }

Problem

The problem is, at the iteration step when running the pipeline, I got many nans, as follows,

Starting iteration 0 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; 1, nan; PARAFAC2 re=9.902 takes 5.5s

I don't see this kind of thing in your tutorial, and I got a quite bad embedding plot as follows, celltype I don't know why was that, I was wondering you have any idea? Thanks!

ruochiz commented 2 years ago

Hi,

The NaN loss is "expected", that's just an intermediate information and doesn't represent that it's actually NaN. I think the problem might occur from the "reordering" of the cells. (I mean you can see that there are three clusters in the UMAP, they just not correspond to the cell types) In Fast-Higashi, cells with better qualities are ordered automatically in front of the cells with bad qualities.

To fix that, load the "reorder.npy" under the input2cache_dir, and use that to reorder the label info by label = label[reorder].

Or... use the label_info loaded in the model instead (mode.label_info['cell type'] for instance.)

See details in the final visualization part here: https://github.com/ma-compbio/Fast-Higashi/blob/main/PFC%20tutorial.ipynb

That being said, I recognize how this can be confusing, and will release a fix to this problem such that the order is corrected automatically by parsing an option to the fetch_cell_embedding() function