vigna / sux

Succinct data structures in C/C++
Other
82 stars 17 forks source link

Creating select_upper and selectz_upper in EliasFano.hpp #19

Closed thomasdwu closed 1 year ago

thomasdwu commented 1 year ago

In EliasFano.hpp, line 119, upper_bits.size has the term (num_ones + (num_bits >> l) + 1), which is converted to the number of words by adding 63 and dividing by 64. To be consistent, shouldn't lines 137 and 138 call SimpleSelectHalf and SimpleSelectZeroHalf by also adding 1 to (num_ones + (num_bits >> l)? That way, in SimpleSelectHalf, line 83, and SimpleSelectZeroHalf, line 83, the value of num_words will be equal to upper_bits.size.

vigna commented 1 year ago

Er... or maybe the +1 is unnecessary. I have to think about this—there's clearly an off-by-one in one of the two locations.

thomasdwu commented 1 year ago

I have an example of incorrect behavior in the current code, where I believe that +1 is needed in both locations.

On Tue, Nov 29, 2022 at 6:16 AM Sebastiano Vigna @.***> wrote:

Er... or maybe the +1 is unnecessary. I have to think about this—there's clearly an off-by-one in one of the two locations.

— Reply to this email directly, view it on GitHub https://github.com/vigna/sux/issues/19#issuecomment-1330715600, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUA6S5TTNNRZYAZVHZCDC3WKYF2VANCNFSM6AAAAAASOQKAYY . You are receiving this because you authored the thread.Message ID: @.***>

thomasdwu commented 1 year ago

Also, line 185 has a debugging statement that has the +1, but you understand the code much better than I do.

On Tue, Nov 29, 2022 at 6:20 AM Thomas Wu @.***> wrote:

I have an example of incorrect behavior in the current code, where I believe that +1 is needed in both locations.

On Tue, Nov 29, 2022 at 6:16 AM Sebastiano Vigna @.***> wrote:

Er... or maybe the +1 is unnecessary. I have to think about this—there's clearly an off-by-one in one of the two locations.

— Reply to this email directly, view it on GitHub https://github.com/vigna/sux/issues/19#issuecomment-1330715600, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUA6S5TTNNRZYAZVHZCDC3WKYF2VANCNFSM6AAAAAASOQKAYY . You are receiving this because you authored the thread.Message ID: @.***>

vigna commented 1 year ago

Can you show me the example?

thomasdwu commented 1 year ago

Here is an example. I am attaching a file containing the position of 851 ones, ranging from 0 to 4456275, where nbits is 4458158. These positions are intended to be stored in an Elias-Fano data structure, and L is computed to be 12. Situation A (current code): If we provide (num_ones + (nbits >> L)) (which equals 1939) in creating SelectSimpleZeroHalf, we get the following inventory:

Inventory 0 2 Inventory 1 125539110932840448 Inventory 2 268530804322075212 Inventory 3 370144812833702929 Inventory 4 494558681006278018 Inventory 5 1845 Inventory 6 0 Inventory 7 0 Inventory 8 0 Inventory 9 0 Inventory 10 1939

Situation B: if we provide (num_ones + (nbits >> L) + 1) (which equals 1940) in creating SelectSimpleZeroHalf, we get the following inventory instead:

Inventory 0 2 Inventory 1 125539110932840448 Inventory 2 268530804322075212 Inventory 3 370144812833702929 Inventory 4 494558681006278018 Inventory 5 1845 Inventory 6 6160384 Inventory 7 0 Inventory 8 0 Inventory 9 0 Inventory 10 1940

Now, suppose we call rank on the Elias-Fano data structure for k = 4456468. which is beyond the last one (4456275). This makes a call to selectz_upper.selectZero(1088), because 4456468 >> 12 is 1088. This yields inventory index 1, inventory rank 1845, subrank 64. Under situation A, we get a start of 1845 and a residual of 0 to return a pos of 1845. This gives a rank of (pos - k_shiftr_L) = 1845 - 1088 = 757, which appears to be incorrect. Under situation B, we get a start of 1939 and a residual of 0 to return a pos of 1939. This gives a rank of (pos - k_shiftr_L) = 1939 - 1088 = 851, which corresponds to the last one in the dataset, as we would expect.

I don't follow all of the code for the inventory, but I hope this information helps in your thinking about the issue. Thank you very much for providing the code, which I am using recently in a bioinformatics program called GSNAP for aligning reads to a genomic sequence.

Tom

On Tue, Nov 29, 2022 at 6:30 AM Sebastiano Vigna @.***> wrote:

Can you show me the example?

— Reply to this email directly, view it on GitHub https://github.com/vigna/sux/issues/19#issuecomment-1330735864, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUA6S2J4HBL62J2PHXDDULWKYHO5ANCNFSM6AAAAAASOQKAYY . You are receiving this because you authored the thread.Message ID: @.***>

0 3300 6802 10122 13043 16545 19843 23186 26688 29773 32869 36168 39288 42408 45528 48719 52221 55539 58687 61986 65285 68576 71875 75174 78673 81993 85313 88256 91199 94142 97085 100025 102968 105911 108854 111797 114740 117683 120626 123569 126512 129455 132398 135341 138284 141227 144170 147488 150450 153391 156333 159271 162212 165155 168097 171600 174672 177615 180687 183631 186930 190433 193724 197227 200314 203635 206732 210053 213174 216514 219635 222935 226438 229757 233076 236225 239525 242825 246327 249830 253333 256841 260141 263643 267143 270646 273590 276534 279478 282422 285366 288310 291254 294198 297142 300086 303030 305974 308918 311862 314806 317750 320694 323638 326582 329526 332475 335433 338376 341320 344264 347765 351268 354406 357350 360294 363238 366182 369685 372629 375576 378649 381722 385021 388524 391898 395979 399302 403288 406611 409651 412691 416014 419337 422297 425620 428943 432266 436117 439971 443822 447673 451524 455376 458088 460799 463511 466223 468935 471647 474359 477071 479783 482495 485207 487919 490631 493343 496055 498767 501479 504191 506903 509615 512593 515305 518017 520729 523441 526153 528865 531577 534289 537001 539713 542425 545137 547849 550561 553270 555982 558695 561418 564135 566847 569559 572271 574983 577695 580407 583120 586103 589426 592137 594849 597827 601140 604463 608543 612623 615945 618984 622023 625345 628667 631989 635311 638270 641592 644914 648236 651558 655408 659261 662583 665294 668005 670911 673622 676333 679044 681755 684466 687788 690499 693210 695921 698632 701343 704052 706764 709470 712185 714896 718778 721488 724800 728112 731434 734756 737454 741534 744245 748563 751637 755955 759308 762662 766016 769934 773288 776642 779996 782971 785946 788921 792341 795386 799494 802849 807167 811350 815533 818887 823070 826424 829778 833132 836486 839518 843836 847190 850544 854862 859178 862532 865886 869241 872595 875949 879303 882657 886971 890325 894243 898161 901515 904869 908223 911577 914931 918867 922785 927103 931421 935739 940022 942997 945972 948946 951921 954896 957871 960846 963821 966796 969771 973705 976680 979655 982630 985605 988580 991555 994530 997505 1000480 1003450 1006425 1009470 1012445 1015420 1018395 1021363 1024338 1027313 1030288 1033263 1036238 1039213 1042188 1045163 1048138 1051113 1054088 1057063 1060029 1063004 1065979 1068954 1071928 1074904 1077869 1080843 1083818 1086793 1089768 1092744 1095719 1098697 1102050 1105478 1108833 1112188 1115163 1119481 1122835 1126189 1130316 1133219 1136573 1139927 1143281 1147559 1150591 1153622 1156649 1159680 1163054 1166407 1170725 1173757 1177111 1180465 1183503 1187420 1190458 1193485 1196839 1201153 1205070 1208424 1212551 1216468 1219802 1222777 1225752 1228727 1231702 1234677 1237652 1240627 1243602 1246577 1249552 1252527 1255502 1258477 1261452 1264427 1267402 1270377 1273352 1276327 1279302 1282277 1285252 1288607 1291582 1294557 1297532 1300507 1303482 1306457 1309432 1312407 1315382 1318357 1321332 1324307 1327281 1330249 1333229 1336205 1339179 1342155 1345130 1348485 1351554 1354909 1358264 1361239 1364214 1367189 1370544 1373899 1377254 1380834 1384188 1387163 1391290 1395417 1399544 1402519 1405494 1408849 1412204 1415558 1419686 1424004 1426979 1431990 1437003 1442013 1447026 1453766 1460506 1467246 1473985 1480725 1487465 1494205 1500945 1504598 1508251 1511904 1516690 1521476 1526262 1531048 1540823 1550598 1560373 1570148 1579905 1589384 1598861 1608339 1613604 1619339 1628816 1634550 1640284 1646018 1655775 1661481 1671238 1681013 1686747 1692481 1698215 1707990 1713724 1718925 1724564 1730204 1735403 1745178 1754953 1764728 1774503 1784278 1793916 1803691 1813466 1823241 1833016 1838655 1844302 1851089 1862615 1874133 1885643 1897165 1908699 1920229 1931755 1943281 1954807 1966333 1977855 1989376 2000886 2012396 2023917 2035439 2046969 2057998 2069027 2080024 2091052 2102081 2113110 2124139 2135168 2146692 2158214 2169740 2181266 2192792 2203874 2214947 2226029 2237558 2249092 2260622 2272148 2283674 2295196 2306728 2318264 2329790 2341316 2352838 2364368 2375894 2387404 2398930 2410452 2421988 2433514 2445040 2456566 2468093 2479623 2491084 2502551 2514012 2525473 2536934 2548401 2559862 2571329 2582796 2594262 2605484 2616951 2628412 2639873 2657707 2675537 2693059 2699543 2706028 2712513 2719005 2725488 2731155 2736824 2743316 2748871 2754398 2759924 2765459 2770997 2777028 2783557 2790086 2795796 2801373 2807227 2813081 2818933 2824126 2829319 2836421 2843523 2850626 2857728 2864180 2871164 2877775 2884877 2891979 2899459 2906401 2913761 2921241 2926909 2932611 2938315 2944019 2949687 2955390 2961093 2966761 2972464 2978168 2983871 2989574 2995277 3000980 3006685 3017765 3029336 3040392 3051448 3062922 3073776 3084623 3095399 3109307 3123176 3136995 3149900 3162805 3175711 3188613 3201414 3214859 3227733 3240402 3253001 3256823 3260608 3263982 3267608 3271126 3274752 3278269 3281895 3285521 3289147 3292773 3296595 3300380 3304118 3307744 3311097 3314763 3318429 3322055 3325680 3329036 3332554 3336062 3339688 3343314 3346940 3350566 3354192 3357818 3361444 3365071 3368697 3372323 3375948 3379574 3383200 3386826 3390452 3394078 3397704 3400985 3404807 3408629 3411953 3415610 3419236 3422862 3426544 3430052 3433678 3437304 3440902 3444528 3448154 3451780 3455406 3459032 3462658 3465974 3469524 3473074 3476617 3480160 3483703 3487246 3490789 3494339 3497883 3501426 3504970 3508514 3511704 3514891 3518081 3521271 3524819 3528367 3531919 3534969 3538019 3541360 3544908 3548451 3551999 3555545 3558683 3561821 3564875 3567627 3570765 3573903 3577041 3580179 3583317 3586455 3589593 3592702 3595733 3598871 3601926 3604980 3608034 3610863 3613727 3616781 3620277 3623766 3626908 3630418 3638379 3646340 3654302 3657846 3661390 3664932 3668476 3672020 3675562 3679083 3682596 3686137 3689689 3693246 3697016 3700786 3704556 3708254 3719800 3731346 3744073 3756801 3768347 3779894 3792201 3804929 3817232 3829538 3841085 3852632 3864179 3875726 3887473 3899243 3911007 3922767 3934519 3946275 3958041 3969805 3982121 3993854 4005625 4017943 4029708 4041465 4053244 4064999 4076761 4088527 4100290 4112045 4123810 4135562 4147333 4159093 4170840 4182606 4194352 4206109 4217874 4229628 4241384 4253151 4264849 4276617 4288389 4300156 4311928 4323680 4335435 4347196 4358949 4370713 4371587 4372307 4373181 4382451 4391721 4400991 4410261 4419531 4430138 4440747 4451356 4451902 4452489 4454392 4456275

vigna commented 1 year ago

Yes, you're right. The point is that if the upper bound u is not a multiple of 2𝓁, then there might be elements whose value of the upper bits is u >> 𝓁, which means you need that position to be part of the upper bits.

If you can give me a real name I'll thank you in the CHANGES section.

thomasdwu commented 1 year ago

Thanks for confirming the change to your code. My name is Thomas Wu, with my signature shown below. Thanks, Tom

Thomas D. Wu, M.D., Ph.D.

Distinguished Scientist, Bioinformatics & Computational Biology

Genentech https://www.gene.com/, A Member of the Roche Group

@.***

Join Genentech on LinkedIn https://www.linkedin.com/company/genentech | Twitter https://twitter.com/genentech?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor | Facebook https://www.facebook.com/Genentech/ | Instagram https://www.instagram.com/genentech/?hl=en | YouTube https://www.youtube.com/genentech

On Sun, Jun 11, 2023 at 7:43 AM Sebastiano Vigna @.***> wrote:

Yes, you're right. The point is that if the upper bound u is not a multiple of 2𝓁, then there might be elements whose value of the upper bits is u >> 𝓁, which means you need that position to be part of the upper bits.

If you can give me a real name I'll thank you in the CHANGES section.

— Reply to this email directly, view it on GitHub https://github.com/vigna/sux/issues/19#issuecomment-1586192650, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUA6S6UFNXXJKHBPCHOFKTXKXKQHANCNFSM6AAAAAASOQKAYY . You are receiving this because you authored the thread.Message ID: @.***>