Call per-sample variants with pandora

mbhall88 commented 4 years ago

Tasks that fall under this issue:

[x] pandora map looking for de novo variants
[x] update MSAs with discovered sequences
[x] turn updated MSAs into PRGs (make_prg)
[x] index new PRG
[x] pandora map not looking for de novo variants

mbhall88 commented 4 years ago

An interesting analysis is to see whether de novo is "more active" when given a sparse PRG (as would be expected). That is, we would expected to see de novo finding more candidate regions in a sparse PRG compared to a dense one.
To that end, I have taken the ratio of the number of candidate regions (that produce valid paths) for sparse vs. dense PRG (i.e. ratio is sparse/dense)

0.9791
0.9797 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9802 0.9807 0.9813 0.9818 0.9823 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9828 0.9834 0.9839 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9844 0.9850 0.9855 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9860 0.9865 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9871 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9876 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9881 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9887 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9892 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9897 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9903 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9908 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9913 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9918 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9924 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9929 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9934 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9940 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9945 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9950 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9955 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9961 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9966 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9971 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9977 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9982 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9987 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9992 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 0.9998 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 1.0003 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 1.0008 1.0014 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 1.0019 1.0024 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 1.0029 1.0035 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 1.0040 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 1.0045 ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ 1.0051 1.0056_ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ ⃫ Tot: 1.50e+02 Avg: 9.93e-01 Std: 4.61e-03

So, we actually see less candidate regions in the sparse PRG...

iqbal-lab commented 4 years ago

What was the trigger for detecting a candidate region again? More than N bases with depth==0? No, you switched to something looking at the ratio of coverages of adjacent bases?

mbhall88 commented 4 years ago

What was the trigger for detecting a candidate region again? More than N bases with depth==0? No, you switched to something looking at the ratio of coverages of adjacent bases?

https://github.com/rmcolq/pandora/pull/141

The final method and parameter set is the default method with a minimum coverage of 2 and minimum length of per-base coverage below this threshold set to 1bp.

mbhall88 commented 4 years ago

Update MSAs rule ran 300 jobs in 40 minutes. Max. memory job was 1.5Gb. Longest running job was 33 minutes.

mbhall88 commented 4 years ago

Run make_prg on all updated MSAs and combine all local PRGs into a single PRG took 20 minutes to run 600 jobs. Max memory was ~1.5Gb.

iqbal-lab commented 4 years ago

On one or many cores?

mbhall88 commented 4 years ago

On one or many cores?

Each job had 16 cores. So for each sample, 16 make_prgs running concurrently

mbhall88 / head_to_head_pipeline

Call per-sample variants with pandora #46