open2c / distiller-nf

A modular Hi-C mapping pipeline
MIT License
85 stars 24 forks source link

Running without duplicates filtering #163

Closed magnitov closed 3 years ago

magnitov commented 3 years ago

Hi,

I wonder if it is yet possible to run distiller without any duplicates filtering?

Mikhail

mimakaev commented 3 years ago

It should not be hard to do. Just wondering why would you need such a modification. I have never experienced the need to omit the dedup step.

magnitov commented 3 years ago

Hi Maxim,

We are trying a specific variation of the Hi-C protocol, which lacks the sonication step. Therefore, a lot of reads after the digestion and ligation steps have the same ends, however they are not real duplicates.

Just wonder how can I tune the .yml file then in order to turn off the deduplication step?

Mikhail

sergpolly commented 3 years ago

check out following section of the project.yml https://github.com/open2c/distiller-nf/blob/5177389afe30e460b45f7982915e353b069ea639/project.yml#L166

In theory max_mismatch_bp: 0 should do the job ... In practice we'd need to double check that If your dataset is small enough you could just rerun it with this flag and check the result

Phlya commented 3 years ago

max_mismatch_bp: 0 will mark exactly matching reads as duplicates, which is what I actually usually use. So not what you need...

sergpolly commented 3 years ago

oh yes - you are right ! sorry about that !

Phlya commented 3 years ago

From my reading of the code, there is no way to do it without modifying the distiller.nf file... But just an option for deactivating dedup could be a relatively simple modification I think. Curious, what protocol you are trying!

magnitov commented 3 years ago

Thanks @Phlya, I'll try to do it then.

Also, I think it should be possible to retrieve all pairs marked as duplicates, and use them to create a new cooler. Am I correct?

Phlya commented 3 years ago

Yes, they are normally saved in a separate pair file, you can use them, and then merge the coolers.

mimakaev commented 3 years ago

I will also check, but maybe if you set deduplication cutoff to -1 it will not filter duplicates.

On Thu, Dec 24, 2020 at 4:10 AM Ilya Flyamer notifications@github.com wrote:

Yes, they are normally saved in a separate pair file, you can use them, and then merge the coolers.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/open2c/distiller-nf/issues/163#issuecomment-750816443, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACIEI6YUQCFHRHYKORNXUVTSWMAPZANCNFSM4VGXUGSQ .

mimakaev commented 3 years ago

I haven't tested it. But I basically don't see why it wouldn't work if you just set min_distance as -1.

Mark_duplicates would just not mark any as duplicates as the condition would never be true.

https://github.com/open2c/pairtools/blob/master/pairtools/_dedup.pyx#L125

If that works, it may be the easiest, at expense of running the deduplication code nevertheless.

magnitov commented 3 years ago

Thanks, @mimakaev! I have tested your suggestion to set max_mismatch_bp: -1 and it does not filter any duplicates, exactly as you expected. I have checked the statistics and visually explored the contact matrices, so far it looks like what I need.

I guess this solves my issue, thanks everyone for support!