ma-compbio / MATCHA

multiway chromatin interaction, 3D genome, single-nucleus, hypergraph representation learning
MIT License
29 stars 5 forks source link

Correspondence between generated data and provided ones #4

Closed yyou1996 closed 2 years ago

yyou1996 commented 2 years ago

Dear MATCHA authors,

Thanks for your great efforts. I currently try to reproduce your code before I further utilize it. According to the readme https://github.com/ma-compbio/MATCHA/tree/master/History_version#running-command, the hyperedges with occurrence frequency 2 are not included which is important for hyperedges of size 5 according to the paper. I thus try to regenerate them with your code.

Below is what I regenerate with process_SPRITE.py and analysis_SPRITE.py.

1667401527114 In text: 2_3_3.npy 3_5_3.npy 3_freq 5_8_3.npy 8_12_3.npy dict_3node upper_3.npy.

Below is what I extract from your provided data occ_3_8.zip.

1667401543531 In text: 3_5_3_intra_inter.npy 3_5_4_intra_inter.npy 3_5_5_intra_inter.npy 3_5_filter_3.npy 3_5_filter_4.npy 3_5_filter_5.npy 5_8_3_intra_inter.npy 5_8_4_intra_inter.npy 5_8_5_intra_inter.npy 5_8_filter_3.npy 5_8_filter_4.npy 5_8_filter_5.npy.

I am wondering about the correspondence between the two sets of files since they are of different names.

ruochiz commented 2 years ago

Thanks for your interest!

The meaning of the files that the code generated: x_y_z.npy stands for hyperedges of size z that has frequencies between x to y. 3_freq folders are temporary files.

For the data that's provided in the zip file: Overall x_y_z are organized the same as above, those with intra_inter in the name is a flag stands for whether this hyperedge is pure intra chromosomal hyperedges or inter-chromosomal hyperedge, those with "filter" in the name is the actual hyperedge. These two files have the same length.

The "intra_inter" files never affect training, it's just used when benchmarking that we calculated auc / accuracy /aupr for intra/inter chromosomal hyperedges separately.

Overall, the main branch in the current repo is maintained in a more readable manner and is compatible with the SPRITE data.