Thanks a lot for the amazing software! But I've noticed a few inconsistencies using it.
First of all, while you're doing mnd file, you assign pos1 and pos2 to center of restrictions fragments. Although original juicer software assign pos1 and pos2 to alignment coordinates (from SAM file). Secondly, you drop duplicates on [pos1, pos1] subset, which enormously reduce the number of contacts. What are your reasons for doing these two actions?
The drop_duplicates was an error, I've fixed it in the latest release (0.4.0)
The reason for using midpoint was to try to maintain some consistency for direct and indirect contacts. For direct contacts it's clear which end of each aligned segment is ligated to which, but for indirect contacts it doesn't really make sense to think of one end ligated to another. As a compromise we picked the midpoint.
Thanks a lot for the amazing software! But I've noticed a few inconsistencies using it. First of all, while you're doing mnd file, you assign pos1 and pos2 to center of restrictions fragments. Although original juicer software assign pos1 and pos2 to alignment coordinates (from SAM file). Secondly, you drop duplicates on [pos1, pos1] subset, which enormously reduce the number of contacts. What are your reasons for doing these two actions?
https://github.com/nanoporetech/pore-c/blob/a865b7ea64d7a972fe878e8f9ea4aa7553def363/pore_c/analyses/contacts.py (line 410) df = ( contact_df.get_partition(partition) .compute() .astype({"align1_strand": "uint8", "align2_strand": "uint8"}) .assign( pos1=lambda x: np.rint(x.eval("0.5 (align1_fragment_start + align1_fragment_end)")).astype(int), pos2=lambda x: np.rint(x.eval("0.5 (align2_fragment_start + align2_fragment_end)")).astype(int), ) .drop_duplicates(subset=["pos1", "pos1"])