lh3 / hickit

TAD calling, phase imputation, 3D modeling and more for diploid single-cell Hi-C (Dip-C) and general Hi-C
100 stars 11 forks source link

Clarification on `seg` format #30

Closed b2jia closed 2 years ago

b2jia commented 2 years ago

The current documentation on the seg format is sparse. Can someone help disambiguate the seg format?

Below I've copied an example Dip-C read (GSE162511), delimited by !. While the first 3 fields are more or less decipherable, what do the last 4 fields correspond to?

K00261:212:HC2YYBBXY:7:1101:13443:1191  chr13!56249365!56249431!+!.!60!1        chr11!95992771!95992813!+!.!60!1

chromosome | start coordinate | end coordinate | strand | haplotype? | score? | ? (always 1 or 2)

tanlongzhi commented 2 years ago

Hi @b2jia, you're right that the 3rd last field is the haplotype, and the 2nd last field is mapping quality. You can find relevant code in the file hickit.js.

I'm not sure what the last field is. However, as far as I understand from code in io.c, the last field is NOT used in downstream analysis -- @lh3 please correct me if I'm wrong.

lh3 commented 2 years ago

The last field counts the number of distant segments a read pair has. It is not actually used IIRC.

b2jia commented 2 years ago

Thank you both!