nanoporetech / pore-c

Pore-C support
Mozilla Public License 2.0
52 stars 5 forks source link

mnd file #44

Open taskina-alena opened 3 years ago

taskina-alena commented 3 years ago

Thanks a lot for the amazing software! But I've noticed a few inconsistencies using it. First of all, while you're doing mnd file, you assign pos1 and pos2 to center of restrictions fragments. Although original juicer software assign pos1 and pos2 to alignment coordinates (from SAM file). Secondly, you drop duplicates on [pos1, pos1] subset, which enormously reduce the number of contacts. What are your reasons for doing these two actions?

https://github.com/nanoporetech/pore-c/blob/a865b7ea64d7a972fe878e8f9ea4aa7553def363/pore_c/analyses/contacts.py (line 410) df = ( contact_df.get_partition(partition) .compute() .astype({"align1_strand": "uint8", "align2_strand": "uint8"}) .assign( pos1=lambda x: np.rint(x.eval("0.5 (align1_fragment_start + align1_fragment_end)")).astype(int), pos2=lambda x: np.rint(x.eval("0.5 (align2_fragment_start + align2_fragment_end)")).astype(int), ) .drop_duplicates(subset=["pos1", "pos1"])

eharr commented 3 years ago

Thanks for pointing that out:

  1. The drop_duplicates was an error, I've fixed it in the latest release (0.4.0)
  2. The reason for using midpoint was to try to maintain some consistency for direct and indirect contacts. For direct contacts it's clear which end of each aligned segment is ligated to which, but for indirect contacts it doesn't really make sense to think of one end ligated to another. As a compromise we picked the midpoint.