FastHigashi wrapper.prep_dataset: 'int' object has no attribute 'shape'

Xieeeee commented 1 year ago

Hi Ruochi, Thank you so much for the great packages! I am recently trying FastHigashi on our single cell HiC dataset (I ran the analysis with Higashi before and the result was good). But when running wrapper.prep_dataset I encountered an issue:

Loading one npy file for check:

I assume chrXXX_sparse_adj.npy should store sparse matrixs but seems that my first element is 0... Any hints for this?

Thanks again, Yang

ruochiz commented 1 year ago

Hey,

Thanks for your interests in Higashi, and glad to hear that it works well on your dataset. Would be happy to help. As for your question:

did you use wrapper.fast_process_data() to generate those files (like here: https://github.com/ma-compbio/Higashi/blob/main/tutorials/FastHigashi_dipc_Tan_mousebrain_500k.ipynb) or did you use Higashi's processing function to generate the sparse matrices?
the data format is it the big "data.txt" or a "filelist.txt" pointing to a couple of contact pair files?
If it's the big "data.txt": a. for the cell_id column, does it go from 0 to the number of cells - 1, b. whether the data.txt is sorted by cell_index first, and did you pass the structured parameter into the config.JSON

Xieeeee commented 1 year ago

Hi, Thanks for the fast respsone!

I use wrapper.fast_process_data to generate these files:
The input is a single data.txt (higashi_v1), with cell id starting from 0. Head of the data is like:

cell_id chrom1  pos1    chrom2  pos2    count
470     chr1    10020   chr1    10136   1
133     chr1    10028   chr1    10148   1
84      chr1    10032   chr1    10156   1

I didn't sort the input by index (did I miss this step?); the structured parameter in config is set to false

Best, Yang

ruochiz commented 1 year ago

OK. Sounds like these are all correct things. You don't need to sort the input by index if you put structured as false.

Did you set the blacklist parameter like the ENCODE blacklist.bed? One thing I just noticed is that, if there are no contacts for a cell -> resulting in empty dataframe of contact pairs, my code would still parse an empty contact map. But, if this empty dataframe pass to the bedtools intersection with the blacklist.bed it would raise an Error in child process, leading to 0s in the sparse_matrix_list, I can definitely fix this very quickly. Just checking if that can be the case.

BTW, in the code, we automatically filter out contacts that have distance < 2500bp, so the above three line of contacts would be deleted during the process. Will include a hyperparamter to allow user to specify that in the next release.

Xieeeee commented 1 year ago

Hi,

I didn't set blacklist parameter.
I have >1000 cells in the current dataset, each has more than 2,000 read pairs that have genomic span greater than 500 Kb (as described in the Higashi paper). I can quickly double check this. In the dataset we also have many cis-short contacts, but I assume this will be automatically filtered and would not case any problem.

ruochiz commented 1 year ago

Hum... interesting. So this is how Higashi generates mtx in a multiprocessing manner:

mtx_all_list = [[0] * len(filelist) for i in range(len(chrom_list))]
p_list = []
pool = ProcessPoolExecutor(max_workers=cpu_num)
for cell_id, file in enumerate(filelist):
    p_list.append(pool.submit(data2mtx, config, file, chrom_start_end, False, cell_id, blacklist))

for p in as_completed(p_list):
    mtx_list, cell_id = p.result()
    for i in range(len(chrom_list)):
        mtx_all_list[i][cell_id] = mtx_list[I]
    bar.update(1)
bar.close()

It first creates [0,0,..0] the length of the number of cells, launch data2mtx process, then fetch the results and put them back in the list to replace the zeros. Thus, when a child process faces an error, it returns nothing, and leave the zero there. So to debug this, could you try sth like this:

from fasthigashi.Fast_process import data2mtx
data = pd.read_table("/home/rzhang/Data/scHiC_collection/ramani/data.txt", sep='\t')
data2mtx(wrapper.config, data[data['cell_id'] == 0],  np.load(os.path.join(wrapper.temp_dir, "chrom_start_end.npy")), False, 0)

Hopefully, this returns the error, and I can help to figure out what triggers it.

Xieeeee commented 1 year ago

Thanks for your detailed explanation! Here is the output of data2mtx:

I am trying Ramani's data to see whether it is some intrinsic problem of my data... and here is the config in case it help:

{"config_name": "SixLine_human_PF_long",
        "data_dir": "/projects/ps-renlab2/y2xie/projects/77.LC/33.Paired_HiC_NovaSeq_221206/05.R/Fast_higashi/",
        "temp_dir": "/projects/ps-renlab2/y2xie/projects/77.LC/33.Paired_HiC_NovaSeq_221206/05.R/Fast_higashi",
        "header_included": "false",
        "input_format": "higashi_v1",
        "structured": "false",
        "genome_reference_path": "/projects/ps-renlab2/y2xie/projects/genome_ref/hg38.main.chrom.sizes",
        "cytoband_path": "/projects/ps-renlab2/y2xie/projects/genome_ref/hg38.main.cytoBand.txt",
        "chrom_list": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22"],
        "resolution": 1000000,
        "resolution_cell": 1000000,
        "resolution_fh": [1000000],
        "local_transfer_range": 1,
        "dimensions": 64,
        "loss_mode": "zinb",
        "rank_thres": 1,
        "embedding_name": "SixLine",
        "impute_list": ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22"],
        "minimum_distance": 1000000,
        "maximum_distance": -1,
        "neighbor_num": 5,
        "cpu_num": 1,
        "gpu_num": 1
}

ruochiz commented 1 year ago

Oh... Hey. I think I might have a theory. Could you change "false" in your config, to just false (no ")

Passing "false" in the config, would result in true in python.

ruochiz commented 1 year ago

With your current config.JSON, Higashi think it's a structured dataset (contacts sorted by cell index first), and whenever it encounter a new cell id, it thinks the contacts of that cells ends here.

Xieeeee commented 1 year ago

Ohhhhhhh... This is such a stupid mistake! Remove the quote and it works smoothly. I will go back to re-run my previous analysis also. Thank you very much!

ma-compbio / Higashi

FastHigashi wrapper.prep_dataset: 'int' object has no attribute 'shape' #30