Closed magnitov closed 3 years ago
It should not be hard to do. Just wondering why would you need such a modification. I have never experienced the need to omit the dedup step.
Hi Maxim,
We are trying a specific variation of the Hi-C protocol, which lacks the sonication step. Therefore, a lot of reads after the digestion and ligation steps have the same ends, however they are not real duplicates.
Just wonder how can I tune the .yml file then in order to turn off the deduplication step?
Mikhail
check out following section of the project.yml
https://github.com/open2c/distiller-nf/blob/5177389afe30e460b45f7982915e353b069ea639/project.yml#L166
In theory max_mismatch_bp: 0
should do the job ...
In practice we'd need to double check that
If your dataset is small enough you could just rerun it with this flag and check the result
max_mismatch_bp: 0
will mark exactly matching reads as duplicates, which is what I actually usually use. So not what you need...
oh yes - you are right ! sorry about that !
From my reading of the code, there is no way to do it without modifying the distiller.nf file... But just an option for deactivating dedup could be a relatively simple modification I think. Curious, what protocol you are trying!
Thanks @Phlya, I'll try to do it then.
Also, I think it should be possible to retrieve all pairs marked as duplicates, and use them to create a new cooler. Am I correct?
Yes, they are normally saved in a separate pair file, you can use them, and then merge the coolers.
I will also check, but maybe if you set deduplication cutoff to -1 it will not filter duplicates.
On Thu, Dec 24, 2020 at 4:10 AM Ilya Flyamer notifications@github.com wrote:
Yes, they are normally saved in a separate pair file, you can use them, and then merge the coolers.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/open2c/distiller-nf/issues/163#issuecomment-750816443, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACIEI6YUQCFHRHYKORNXUVTSWMAPZANCNFSM4VGXUGSQ .
I haven't tested it. But I basically don't see why it wouldn't work if you just set min_distance as -1.
Mark_duplicates would just not mark any as duplicates as the condition would never be true.
https://github.com/open2c/pairtools/blob/master/pairtools/_dedup.pyx#L125
If that works, it may be the easiest, at expense of running the deduplication code nevertheless.
Thanks, @mimakaev!
I have tested your suggestion to set max_mismatch_bp: -1
and it does not filter any duplicates, exactly as you expected. I have checked the statistics and visually explored the contact matrices, so far it looks like what I need.
I guess this solves my issue, thanks everyone for support!
Hi,
I wonder if it is yet possible to run distiller without any duplicates filtering?
Mikhail