sc-zhang / ALLHiC_components

Some components that speed up and reduce resource cost for original ALLHiC
BSD 3-Clause "New" or "Revised" License
12 stars 2 forks source link

ALLHiC_prune. #5

Open zhangyixing3 opened 1 year ago

zhangyixing3 commented 1 year ago

Hi, @sc-zhang 你是否测试过ALLHiC_prune? 我发现它的结果和默认的ALLHIC_Prune结果不一致。

 for (long i = 2; i < data.size(); i++) 
     for(std::unordered_map<int, long>::iterator iter=ctgdb.begin(); iter!=ctgdb.end(); iter++)

这里我认为你这两个循环内外层反了,只有先循环ctgdb ,其次是等位表,这样每个contig才能在每行的等位表选出最高的信号。

if(num_r>numdb[ctg2]){
    allremovedb[ctg2].insert(retaindb[ctg2]);
    retaindb[ctg2] = ctg1;
    numdb[ctg2] = num_r;
}

这里我认为你忽略了else 情况下也要删除的信号。

sc-zhang commented 1 year ago

For the first question, the origin one was implemented it with same way, because the key of retaindb were contigs from ctgdb, and the value would be the contig which has highest signal with the key from each allele group, and the retaindb was be cleared while reading a new allele group. For the second one, you are right, I forgot deal it. The bug you reported here will be fixed with a full test within a few days.

zhangyixing3 commented 1 year ago

Thank you, but I still have doubts about the first guess. I have re-implemented Prune using Rust, and my own dataset demonstrates that it can completely reproduce the default Prune results of AllHIC.

sc-zhang commented 1 year ago

After fix the bug you mentioned in the second question, a test was taken with a dataset, the results generated by origin one and the bug fixed one were compared, because of the different methods for generating bam files, the bam files had a few differences, but the sam files converted from these bam files were same when no header output. That means the latest version of ALLHiC_prune has same result with the original one from ALLHiC repo. The bug fixed version of ALLHiC_prune has been update.