Closed Xieeeee closed 1 year ago
Hey,
Thanks for your interests in Higashi, and glad to hear that it works well on your dataset. Would be happy to help. As for your question:
did you use wrapper.fast_process_data()
to generate those files (like here: https://github.com/ma-compbio/Higashi/blob/main/tutorials/FastHigashi_dipc_Tan_mousebrain_500k.ipynb) or did you use Higashi's processing function to generate the sparse matrices?
the data format is it the big "data.txt" or a "filelist.txt" pointing to a couple of contact pair files?
If it's the big "data.txt":
a. for the cell_id column, does it go from 0 to the number of cells - 1,
b. whether the data.txt is sorted by cell_index first, and did you pass the structured
parameter into the config.JSON
Hi, Thanks for the fast respsone!
I use wrapper.fast_process_data
to generate these files:
The input is a single data.txt (higashi_v1
), with cell id
starting from 0. Head of the data is like:
cell_id chrom1 pos1 chrom2 pos2 count
470 chr1 10020 chr1 10136 1
133 chr1 10028 chr1 10148 1
84 chr1 10032 chr1 10156 1
structured
parameter in config is set to false
Best, Yang
OK. Sounds like these are all correct things. You don't need to sort the input by index if you put structured as false.
Did you set the blacklist
parameter like the ENCODE blacklist.bed?
One thing I just noticed is that, if there are no contacts for a cell -> resulting in empty dataframe of contact pairs, my code would still parse an empty contact map. But, if this empty dataframe pass to the bedtools intersection with the blacklist.bed it would raise an Error in child process, leading to 0s in the sparse_matrix_list, I can definitely fix this very quickly. Just checking if that can be the case.
BTW, in the code, we automatically filter out contacts that have distance < 2500bp, so the above three line of contacts would be deleted during the process. Will include a hyperparamter to allow user to specify that in the next release.
Hi,
blacklist
parameter.Hum... interesting. So this is how Higashi generates mtx in a multiprocessing manner:
mtx_all_list = [[0] * len(filelist) for i in range(len(chrom_list))]
p_list = []
pool = ProcessPoolExecutor(max_workers=cpu_num)
for cell_id, file in enumerate(filelist):
p_list.append(pool.submit(data2mtx, config, file, chrom_start_end, False, cell_id, blacklist))
for p in as_completed(p_list):
mtx_list, cell_id = p.result()
for i in range(len(chrom_list)):
mtx_all_list[i][cell_id] = mtx_list[I]
bar.update(1)
bar.close()
It first creates [0,0,..0] the length of the number of cells, launch data2mtx process, then fetch the results and put them back in the list to replace the zeros. Thus, when a child process faces an error, it returns nothing, and leave the zero there. So to debug this, could you try sth like this:
from fasthigashi.Fast_process import data2mtx
data = pd.read_table("/home/rzhang/Data/scHiC_collection/ramani/data.txt", sep='\t')
data2mtx(wrapper.config, data[data['cell_id'] == 0], np.load(os.path.join(wrapper.temp_dir, "chrom_start_end.npy")), False, 0)
Hopefully, this returns the error, and I can help to figure out what triggers it.
Thanks for your detailed explanation!
Here is the output of data2mtx
:
I am trying Ramani's data to see whether it is some intrinsic problem of my data... and here is the config in case it help:
{"config_name": "SixLine_human_PF_long",
"data_dir": "/projects/ps-renlab2/y2xie/projects/77.LC/33.Paired_HiC_NovaSeq_221206/05.R/Fast_higashi/",
"temp_dir": "/projects/ps-renlab2/y2xie/projects/77.LC/33.Paired_HiC_NovaSeq_221206/05.R/Fast_higashi",
"header_included": "false",
"input_format": "higashi_v1",
"structured": "false",
"genome_reference_path": "/projects/ps-renlab2/y2xie/projects/genome_ref/hg38.main.chrom.sizes",
"cytoband_path": "/projects/ps-renlab2/y2xie/projects/genome_ref/hg38.main.cytoBand.txt",
"chrom_list": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22"],
"resolution": 1000000,
"resolution_cell": 1000000,
"resolution_fh": [1000000],
"local_transfer_range": 1,
"dimensions": 64,
"loss_mode": "zinb",
"rank_thres": 1,
"embedding_name": "SixLine",
"impute_list": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22"],
"minimum_distance": 1000000,
"maximum_distance": -1,
"neighbor_num": 5,
"cpu_num": 1,
"gpu_num": 1
}
Oh... Hey. I think I might have a theory. Could you change "false" in your config, to just false
(no ")
Passing "false" in the config, would result in true
in python.
With your current config.JSON, Higashi think it's a structured dataset (contacts sorted by cell index first), and whenever it encounter a new cell id, it thinks the contacts of that cells ends here.
Ohhhhhhh... This is such a stupid mistake! Remove the quote and it works smoothly. I will go back to re-run my previous analysis also. Thank you very much!
Hi Ruochi, Thank you so much for the great packages! I am recently trying
FastHigashi
on our single cell HiC dataset (I ran the analysis withHigashi
before and the result was good). But when runningwrapper.prep_dataset
I encountered an issue:Loading one npy file for check:
I assume
chrXXX_sparse_adj.npy
should store sparse matrixs but seems that my first element is 0... Any hints for this?Thanks again, Yang