mervesa / HiCDCPlus

HiCDCPlus
15 stars 2 forks source link

effective length in function of construct_features #13

Closed TingtingSsl2 closed 1 year ago

TingtingSsl2 commented 2 years ago

Dear HiCDCPlus developer,

Could you explain the effective length in the output of construct_features.

Code to generate output: construct_features(output_path=paste0(outdir,"/test"), gen=speci, gen_ver=genv, sig=c("GATC","GANTC"), bin_type="Bins-uniform", binsize=50000)

The output looks like: $ zcat hg38_50kb_GATC_GANTC_bintolen.txt.gz | head bins gc len chr1-1-50000 0.475684596577017 37765 chr1-50001-100000 0.390174311926606 49418 chr1-100001-150000 0.441838649155722 49203 chr1-150001-200000 0.477571157495256 46540 chr1-200001-250000 0.485352112676056 7754 chr1-250001-300000 0.405630530973451 40972 chr1-300001-350000 0.442142857142857 2480 chr1-350001-400000 0.462342954159593 48969 chr1-400001-450000 0.392915851272016 48102

One other question I have here is that I provided both binsize and sig in construct_features, however, I only see bins in the output file, where can I find enzyme cutting sites in outputfile?

Thank you again for helping!

Bests,

Tingting

mervesa commented 2 years ago

Hi @TingtingSsl2, Effective length is the effective sequence space for uniform bin pairs. Following Carty et al., 2017, it is the fraction of each of the corresponding genomic intervals that is within 500 bp of a RE within the interval; the effective sequence space is the product of these fractions.

As for your other question, you can try bin_size=1 and bin_type='Bins-RE-sites' to get cutsites on a separate gi_list instance.