iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Improving cluster size threshold choice #328

Open leoisl opened 1 year ago

leoisl commented 1 year ago

We could either automatically choose a cluster size threshold or at least provide cluster size histogram for user. Right now cluster sizes can be retrieved by parsing debugging files, but it might be worth it to upgrade it to a histogram and created by default? See https://github.com/mbhall88/drprg-paper/issues/2

mbhall88 commented 1 year ago

As an example, here is the cluster size distribution for a HiSeq 2000 run with 75bp reads

    176 1
    194 2
    324 3
    399 4
    647 5
    927 6
   2747 7
   5190 8
   5987 9
   5047 10
   2236 11
    727 12
    328 13
    135 14
     51 15
      6 16

and now a 250bp Illumina sample for the same region

81806 1
  22335 2
   1485 3
    693 4
    382 5
    374 6
    455 7
    520 8
    434 9
    487 10
    441 11
    539 12
    541 13
    643 14
    504 15
    615 16
    696 17
    578 18
    713 19
    698 20
    641 21
    723 22
    674 23
    728 24
    673 25
    717 26
    697 27
    749 28
    746 29
    767 30
    761 31
    836 32
    949 33
   1312 34
   1495 35
   1772 36
   2021 37
   2358 38
   2402 39
   2235 40
   1875 41
   1358 42
    848 43
    666 44
    518 45
    355 46
    271 47
    148 48
    150 49
     41 50
     35 51
     10 52
      5 53
     17 54
      2 55
      7 56
      9 57
      7 58
      2 59