Closed mschilli87 closed 8 years ago
i don't observe this bug. can you please simulate reads with insert sizes differing by >= 5 and try again?
here is working example and the model behaves as it should pAi = [{'start' : 650, 'end' : 0, 'strand' : '+', 'is_tail' : True}]
f_size_1 = np.array([100]) f_prob_1 = np.array([1]) f_size_2 = np.array([95, 100]) f_prob_2 = np.array([0.1, 0.9])
reads = [592] * 100
Lrange = range(30, 50, 1)
print (estimate_poly_tail_length(reads, Lrange, pAi, 0, f_size_1, f_prob_1, False)) print (estimate_poly_tail_length(reads, Lrange, pAi, 0, f_size_2, f_prob_2, False))
the model seems to behave nicely if you change also the reads, i.e. reads = [592, 595] * 100
So this suggests a problem in the simulation then.
I'll investigate this further & push potential fixes to the simulation
branch.
did you reproduce a nice behavior with the above script? i think it's nice how the predictions change by stacking up more reads...
I'll first finish off #64 but then I'll look into this in more detail.
Running the following script shows no shift between fixed & varied fragment sizes:
from estimate_length import *
import numpy as np
pAi = [{'start' : 650, 'end' : 0, 'strand' : '+', 'is_tail' : True}]
f_size_1 = np.array([100])
f_prob_1 = np.array([1])
f_size_2 = np.array([95, 100])
f_prob_2 = np.array([0.1, 0.9])
reads = [592] * 100
Lrange = tail_length_range(30, 50, 1)
probs = estimate_poly_tail_length(reads, Lrange, pAi, 0, f_size_1, f_prob_1, False)
est_pAlen=int(round(sum(Lrange*probs))) # expected value
print(probs)
print(est_pAlen)
probs=estimate_poly_tail_length(reads, Lrange, pAi, 0, f_size_2, f_prob_2, False)
est_pAlen=int(round(sum(Lrange*probs))) # expected value
print(probs)
print(est_pAlen)
However, it predicts 43:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8990342029304367, 0.09023065785136214, 0.00953609117678319, 0.0010588601503926005, 0.00012326392109555755, 1.5014056117568236e-05, 1.9099138122281647e-06]
43
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1002070435633935e-95, 1.5637274851488367e-96, 1.2434337170168975e-97, 1.0525339191030397e-98, 9.455679264074604e-100, 0.8990342029305111, 0.09023065785128753, 0.009536091176784521, 0.0010588601503916648, 0.00012326392109544861, 1.5014056117554116e-05, 1.9099138122284315e-06]
43
So is...
estimate_poly_tail_length
, or592+100-650=42 after all. ;)
see https://github.com/rajewsky-lab/polyA/pull/63#issuecomment-226753733
Maybe the remaining deviations simply reflect the sampling noise of how I simulate the fragment lengths:
As of now I simply emit each length with the given probability. Instead I could make this depend on previous emissions to ensure reflecting the bioanalyzer profile correctly:
For 100 reads with 90% 95nt & 10% 100nt I could simply stop generating 95nt fragments after doing so 10times. Likewise, having only 8 95nt fragments after 98 rounds of simulation could enforce the last two fragments to have this length.
I think this will ensure that the profile put into estimate_poly_tail_length
actually reflects the length distributions of the observed fragments, as we assume for the model.
What do you think?
I adjusted the fragment size sampling forcing the simulation to follow the given profile more closely (also for smaller read counts) with b224a66, but the unit test still fails. For 100 reads I didn't see a clear improvement but it also didn't get worse afaict. I still like this approach more for the reasons mentioned before.
Simulating 1,000 reads/gene I get 42 in all cases in the 90/100-95/100nt setup but it takes very long with the new code. I'll refactor that part to make it a bit faster (+ more easy to read).
ok i was absorbed by the presentation. when you're somehow ready just assign me to test things out. i think we need to brainstorm at some point about the assumptions. also pcr duplicates are tricky: assume that 200 reads come from the same molecule and hence from a fixed length, and on top of that 40 + 40 + 40 + 40 + 40 reads from 4 different molecules. Obviously putting them all together and assuming uniform hybridization site is a mistake...(which is what happens now in the pipeline)
Refactoring didn't improve the runtime but at least it's more readable now. I'll set a random seed to make all unit test runs comparable & increase the number of reads until the unit test passes. We should still think about a better way to assess correctness for the unit tests than expecting all expected values to be 42.
Setting the seed & increasing the number of reads/gene to 450 resulted in the unit test passing for the following cases:
Next, I'll
PS: The devations in the probs
after setting the seed seem to be relate do rounding issues in the estimation code as the reads simulated were identical in each run.
If I plug in the (discretized) bioanalyzer profile (replace f_size_sim
/f_prob_sim
with f_size
/f_prob
), the estimation shifts to 40 with p>0.999.
So far I have no clue why that happens.
Could you maybe pull 20e2518fb5762b43860ebd75dc1c6194abd9fc14 from my branch to your fork & play a bit around, too?
i can test it tomorrow, but i have a good clue: the bioanalyzer profile is discretized in steps of 5 and 40 is the closest value to the real value. I guess it improves if you change the bin width?
Discretization in steps of 1 didn't change the result. Even the probability assigned to 40 is > 0.999 again.
While the following length distributions work:
f_size_sim = np.array([300, 400])
f_prob_sim = np.array([.1, .9])
and
f_size_sim = np.array([300, 400])
f_prob_sim = np.array([.5, .5])
The following distribution breaks the unit test:
f_size_sim = np.array([300, 320, 340, 350, 360, 380, 400])
f_prob_sim = np.array([.02, .12, .2, .3, .2, .13, .03])
I think this is a good example to narrow down the problem, as it does work well enough for some genes, get's problematic for others and completely breaks for yet another one:
[1.9969877340810926e-13, 2.8136545867080593e-10, 0.9509683892092384, 0.042675888079516264, 0.0060855196507041635, 0.0002549226666900081, 1.3050444390499167e-05, 1.3261754179595064e-06, 8.845800706808311e-07, 1.891240686283267e-08]
0.999012654723 4.15278663783e-12
[0.2730612417894287, 0.03820488042421049, 0.6754383724389108, 0.01301824387906198, 0.00025718976678013803, 1.878029496819847e-05, 9.031434771263687e-07, 3.0182635211973986e-07, 8.629115842671406e-08, 1.4565203546605682e-10]
0.922423603513 0.000144172525697
[1.2615019624356758e-12, 6.212140867719817e-13, 0.988341957031698, 0.01147686136138714, 0.0001671437081106384, 1.3969344914829683e-05, 6.70304052278555e-08, 1.4115423396444955e-09, 5.358535264895298e-11, 5.647368673262926e-11]
0.999933491623 8.55950766537e-17
[5.84929559327365e-13, 0.04275430900189669, 0.11257306130217708, 0.35461273382048475, 0.4865700322685153, 0.0034728360091119807, 1.6779905059307196e-05, 2.2997959177700293e-07, 1.176788323955293e-08, 5.9446949269363095e-09]
0.0251807979816 0.944951917747
These are posterior probabilites for tail_length_range(40, 50, 1)
followed by Pearson's r and the corresponding p-value on the next line for 4 simulated genes with tail length 42 and the above fragment size distribution.
Note how the 1st one looks good, the 2nd is already much more problematic, the 3rd one much better again & the 4th one is already completely off introducing a shift that causes the unit test to fail.
This was really surprising to me given that all these genes are simulated with identical parameters, so the huge deviations in tail length estimation must be caused by random differences between the samples.
This was done with 450 reads per gene but I'll test bigger sample sizes next.
This is what I get simulating 1000 reads per gene for the above fragments size distribution:
[1.9407078351651253e-17, 6.710958231196733e-12, 0.9591893598014722, 0.039974036912426174, 0.0008338967817822885, 2.705049666523851e-06, 1.4465028651225567e-09, 1.4374977149895275e-12, 1.391907033583953e-15, 8.669026579146442e-20]
0.999139424727 2.39709335636e-12
[7.961801741572024e-06, 6.118239856596728e-05, 0.9892921586687393, 0.010624850470933412, 1.3832604954238043e-05, 1.393837962975556e-08, 1.1653771682288547e-10, 1.3779288676696453e-13, 1.0146790296561322e-14, 1.5247164506130794e-16]
0.999943017078 4.61240276628e-17
[5.725308666394131e-26, 8.255539323724488e-20, 0.9997162811649689, 0.00028274673577458195, 9.71990183227256e-07, 1.0902491896001067e-10, 4.828981436542037e-14, 3.996545797932851e-17, 2.227210888064498e-21, 5.619363994693017e-25]
0.999999960529 1.06189319255e-29
[1.1684775623810784e-05, 0.004993671849427283, 0.9931892000676807, 0.0018054172281961123, 2.6038063470840944e-08, 3.955397798842367e-11, 9.756406872844674e-13, 4.717831882648751e-13, 7.15444480069299e-15, 1.2812345342632306e-18]
0.999987002996 1.24837270461e-19
[1.386596059131873e-06, 0.0048596394717116105, 0.9951366093645703, 2.3581282082888647e-06, 6.438780633767851e-09, 6.691950071390118e-13, 8.846353547232302e-16, 1.1261028760558202e-17, 4.238794139237955e-19, 8.107649111612832e-20]
0.999988213141 8.44431146412e-20
[1.3859905992717268e-11, 0.00011586111122621953, 0.999833665320657, 5.024834221984415e-05, 2.2225599136541803e-07, 2.954964267210693e-09, 1.0810183976410995e-12, 3.8539806121387536e-16, 2.6268003913904463e-19, 1.2689994665759576e-22]
0.999999992845 1.14674280366e-32
[7.8465218062249285e-25, 2.231802888460719e-11, 0.9930469153479121, 0.00688876608736662, 6.399819229265967e-05, 3.2022058454095896e-07, 1.2951497893883887e-10, 1.1026387727034506e-14, 3.945861251813544e-17, 1.7599303805687152e-19]
0.999976253399 1.39114224122e-18
[2.9696832529379845e-13, 0.6799830183725007, 0.3199554019211117, 6.138827404338114e-05, 1.9143061251928764e-07, 1.4255202390388715e-12, 8.226342568964062e-15, 6.638749382697589e-16, 3.1581214960559337e-16, 5.778384470381298e-19]
0.340098284122 0.336279530103
So while increasing the sample size >2-fold improved the robustness of the predictions, even at this depth we have clear outliers even for this relatively simple distribution.
Also note that while in the previous case (450 reads/per gene) we over-estimated the tail length in the case of interest, here (1000 reads/gene) we under-estimate it. So I don't think that there is any directionality in this effect (Note that the input distribuion is slightly skewed towards longer fragments).
I committed the switch to the above fragment size distribution incl. the necessary increase of the number of simulated reads per gene from 450 to 1500 (https://github.com/rajewsky-lab/polyA/pull/77/commits/02dae4d534f99ba1feaeb94c0c789c095a9ea77d).
Note however, that with just with 50 reads less (1450 reads/gene) I got the following:
[2.3843768140684402e-27, 2.830751884868016e-13, 0.9970254271913563, 0.0029676234492711554, 6.949136966396309e-06, 2.2210188216486656e-10, 2.1131606113383888e-14, 2.0351379352713304e-18, 4.264777298204097e-23, 2.1862717601872716e-28]
0.999995624645 1.60335207612e-21
[6.3289646781878336e-15, 0.007491590761959842, 0.9925046409156311, 3.724779700591985e-06, 4.3542531867917975e-08, 1.70258778376353e-13, 1.488638556536358e-17, 5.442559879350971e-21, 7.09733039373871e-22, 1.0984823949740132e-27]
0.999971821735 2.75816580933e-18
[1.8171230830746176e-05, 0.0020144330684920566, 0.9972493664259046, 0.0007180284540947389, 8.206765391738758e-10, 1.3291595564133472e-15, 6.984092003962837e-18, 7.073825978590545e-21, 2.1704556987140163e-24, 2.1098280615172175e-28]
0.999997913287 8.29521542483e-23
[6.109485921774387e-11, 0.006801216396084559, 0.9931987802870931, 3.2557087371127057e-09, 1.8738614701913064e-14, 3.478432042264354e-21, 2.814396108234619e-26, 2.5311332142674436e-29, 2.6928324318057627e-32, 1.5445634834233634e-33]
0.999976808877 1.26547353598e-18
[3.3166196248976037e-22, 1.598691247396427e-19, 0.9997361699155801, 0.00026382585095016653, 4.23283600558462e-09, 6.337155753994707e-13, 2.691749902106613e-18, 4.931287704373248e-20, 2.641348421100822e-25, 5.424738025226171e-31]
0.999999965608 6.12108335758e-30
[5.046734670378583e-18, 2.811779674324627e-13, 0.9999999951039438, 4.8956459512295975e-09, 1.2787374535361536e-13, 1.2697029117045974e-15, 1.6486849359036593e-20, 7.278484175744991e-25, 2.467200950927863e-29, 7.773832442675696e-35]
1.0 0.0
[2.8307212303665613e-30, 3.0717115947291643e-24, 0.9994732077176162, 0.0005267899869363544, 2.294108685575821e-09, 1.3388015995174421e-12, 6.551865550055873e-18, 4.5344247032420313e-23, 2.4098031735777854e-27, 3.0775618682611796e-31]
0.999999862799 1.55027926427e-27
[8.654291975308754e-11, 1.5669480614264665e-10, 0.9999999987327134, 1.0225690819976957e-09, 1.4797219482772247e-12, 4.714705261060162e-17, 2.252461401068581e-20, 1.453118649138929e-26, 3.001703823537954e-30, 3.675818481538593e-37]
1.0 0.0
[1.9945467615726222e-23, 3.615742873767955e-15, 0.9999980887603138, 1.911239668935829e-06, 1.361574456387827e-14, 1.5764101206174318e-20, 4.322518999820765e-25, 9.929587876501691e-26, 6.404383082838758e-32, 4.598567481115981e-34]
0.999999999998 4.63025353724e-47
[2.04148338818621e-14, 4.242774506533167e-12, 0.9999946302146393, 5.3697779023597174e-06, 3.1949359764181505e-12, 1.5770484040145607e-16, 1.9266151486215462e-22, 5.6560285663419295e-24, 8.440142812726995e-27, 2.3929317598600458e-32]
0.999999999986 1.79863601749e-43
[0.0004670136287759296, 0.857130971247841, 0.14240180357241863, 2.1155044843357215e-07, 5.160522216443398e-13, 2.350270232296246e-19, 8.428921779277911e-25, 3.7767139856410786e-27, 9.900522334323823e-32, 1.7763925838511122e-34]
0.0552278759741 0.879556835213
So my feeling is that 1500 reads/gene are sufficient for this exact case simulated with the given seed but in any real example this could still easily break.
To me this is not satisfactory and while I could plug in the real bioanalyzer profile and increase the number of simulated reads until the test passes, I think it would be more helpful to figure out if
Any ideas/alternative hypotheses?
thanks, very helpful. what was the difference between the coverage for the gene that failed? i feel that with low number of reads (i.e. ~200+ but don't know exactly) maybe our precision is within the bioanalyzer values, for this case +-20nts. let's check tomorrow two things:
Switching to
f_size_sim = np.array([300, 310, 320, 330, 340, 350, 360])
f_prob_sim = np.array([.02, .12, .2, .3, .2, .13, .03])
resulted in all tests passing with 1500 reads/gene (as expected), but going down to 1450 reads/genes made the test fail again:
[1.5378522916900327e-29, 3.669525313837876e-14, 0.9999899915065682, 9.958875123648815e-06, 4.950991029928015e-08, 1.0828486583850624e-10, 7.623879973092766e-14, 4.2152120539788375e-18, 2.316091228303865e-19, 5.935874718802714e-23]
0.999999999951 2.50539266229e-41
[1.1350016620526029e-16, 0.028790096361439876, 0.9674989468558134, 0.003708678984166392, 2.2777661237221253e-06, 3.2450028568239055e-11, 6.4936303096293416e-15, 4.624455977273954e-18, 1.502430985215576e-20, 4.23973088597953e-24]
0.999566605823 1.54271059481e-13
[0.00016442186875497013, 0.0017845470107763135, 0.9977812008030099, 0.0002687072506299273, 1.1230475423355605e-06, 1.9272801196925463e-11, 1.0974269024454015e-14, 2.8015933409203463e-15, 2.3656908529118917e-18, 3.3869607228983657e-21]
0.999998472012 2.38482601718e-23
[5.788774822920124e-13, 0.00858015026713982, 0.9914198477029613, 2.0290499907677343e-09, 2.6998454971518767e-13, 2.1670031037034664e-18, 4.962466713522687e-23, 2.3130521922903195e-25, 3.7733326584714716e-29, 7.21663317286263e-28]
0.999962943789 8.24902843167e-18
[2.415464971574151e-21, 4.931623025378152e-19, 0.9965189267192426, 0.0034802663107232658, 8.068012973827896e-07, 1.6873535132512604e-10, 1.4225161483059841e-15, 6.747462612427343e-19, 4.2229789889382285e-24, 1.0555729385167392e-27]
0.999993972504 5.77461035783e-21
[3.719231828697996e-18, 5.165881548701048e-15, 0.99999969990882, 2.9506216698350924e-07, 4.9845638902616535e-09, 4.442956744078232e-11, 1.4295576212244435e-14, 6.681018866872057e-17, 5.852532141740759e-19, 2.0027160110841483e-22]
1.0 1.50641932711e-53
[1.2564969096047024e-29, 2.771843014711505e-26, 0.9999903637237828, 9.636274253457086e-06, 1.962497369252916e-12, 1.2699640246446983e-15, 4.4434522141382307e-20, 2.1859822415813603e-24, 3.241914095893032e-27, 7.404472113802038e-33]
0.999999999954 1.93460199173e-41
[1.2062635833458325e-12, 7.593731894903284e-12, 0.9999999996859834, 3.051694178789472e-10, 4.7107730163674816e-14, 3.334564043460014e-17, 7.882865341674034e-21, 7.518039931887295e-26, 1.6426175865422349e-28, 4.431324464962173e-33]
1.0 0.0
[8.721557708691834e-24, 3.457339882393091e-17, 0.9999976364854274, 2.363513699040845e-06, 8.735847769554552e-13, 5.722558239920193e-18, 5.513898607827274e-20, 3.034019970204752e-22, 8.209845107295205e-26, 7.90245149976581e-28]
0.999999999997 2.53346446146e-46
[7.921572231974481e-16, 2.0345611715775155e-11, 0.9999874202800754, 1.2579622139875168e-05, 7.732006716721334e-11, 1.1825726298588344e-13, 2.4812730023893458e-17, 1.4494684958882704e-18, 4.701456779935223e-19, 5.587329315390636e-23]
0.999999999922 1.63179389996e-40
[8.276924346707598e-07, 0.836860388156785, 0.1631198795893221, 1.8904424802964442e-05, 1.3665201787050077e-10, 3.308325212414529e-15, 6.787152812159497e-22, 2.1375350006197484e-25, 9.396963821562087e-31, 7.917798172126044e-33]
0.0840292985015 0.817478317301
Note that it breaks for the same gene with a similar posterior probability distribution.
Here are the results for the following distribution:
f_size_sim = np.array([300, 301, 302, 303, 304, 305, 306])
f_prob_sim = np.array([.02, .12, .2, .3, .2, .13, .03])
1500 reads/gene worked again, 1450 reads/gene fails:
[1.0206594870242274e-36, 4.990426692267784e-09, 0.9999998819418184, 1.1306775491181624e-07, 8.204090795798328e-19, 1.1494196600629723e-31, 3.818618368338791e-45, 1.3421997957701662e-58, 7.640912851599131e-72, 7.918144073093456e-85]
1.0 7.52195880888e-57
[9.121922062668281e-50, 0.009270834931176938, 0.9907288978384194, 2.672304035641155e-07, 2.9475488745752995e-18, 3.17457821305193e-31, 1.0393719354764074e-44, 3.362373703430295e-58, 1.856717337287089e-71, 1.9240836357101977e-84]
0.999956671341 1.54190091581e-17
[4.2120030864984624e-35, 0.00538622007762105, 0.9946136491912408, 1.3073113813870303e-07, 1.0309673716121863e-18, 8.478630412871737e-32, 2.0817365283866143e-45, 6.336423935092587e-59, 3.499000769816702e-72, 3.625953174092908e-85]
0.999985500766 1.9335292232e-19
[2.766099682902426e-64, 1.943757116306812e-05, 0.999976031721393, 4.530707443565408e-06, 2.875117259526414e-16, 7.069200471150592e-29, 2.928414157672681e-42, 9.766428117703482e-56, 5.393063950915207e-69, 5.588737653231438e-82]
0.999999999814 5.21971772449e-39
[3.1653056287750067e-99, 2.607154379901798e-16, 0.9998068181422705, 0.0001931818575855522, 1.4372948576180979e-13, 1.0702051917485102e-25, 8.177510549475042e-39, 3.45065562240439e-52, 1.9643989646476904e-65, 2.035672218909505e-78]
0.999999981563 5.05538714716e-31
[2.23463706359158e-55, 3.1595808346417124e-21, 0.9999667322838613, 3.3267716135376706e-05, 3.2714345223330387e-15, 1.3104917526683922e-27, 5.89100655583306e-41, 2.3910731648811256e-54, 1.403295747291699e-67, 1.4542107886892996e-80]
0.999999999453 3.90472581211e-37
[3.0285123176383265e-83, 1.142207863102893e-23, 0.9993363403292077, 0.0006636596703005937, 4.916736182253376e-13, 4.4531866862525275e-25, 3.082502663278678e-38, 1.370855304973881e-51, 8.045405920146746e-65, 8.337313150886457e-78]
0.999999782176 9.84926030298e-27
[6.365781544786275e-88, 5.749243218232794e-17, 0.9986252424971576, 0.0013747575015723155, 1.2700275487421477e-12, 1.1271921098629669e-24, 7.521568793255464e-38, 2.986292929675179e-51, 1.7000452612854683e-64, 1.761727109190713e-77]
0.999999063831 3.36043090079e-24
[1.905916785034272e-62, 3.065101708546079e-13, 0.9999911275759303, 8.872423762473502e-06, 6.453072763461235e-16, 1.6692399159435738e-28, 7.173036290420014e-42, 2.679603094474846e-55, 1.5254520069392933e-68, 1.5807991796448723e-81]
0.999999999961 9.99195300969e-42
[1.9943853055312862e-42, 1.1096241935552741e-10, 0.9999990873001466, 9.125888908901918e-07, 1.6666146104261207e-17, 3.309213411060597e-30, 1.1684467992916388e-43, 4.1069541496528735e-57, 2.3380184412094812e-70, 2.42284753439568e-83]
1.0 1.25113129857e-49
[2.452391768506468e-28, 0.7470350085783584, 0.25296498583907384, 5.582567820111438e-09, 2.3785377281381264e-20, 2.107549892670725e-33, 5.499642435611269e-47, 1.6739902234649142e-60, 9.243846593230063e-74, 9.579236216431676e-87]
0.223158610983 0.535432991927
Again it is the same gene, and again the profile is slightly better (as in lower certainty for the wrong answer).
Changing the random seed from 42 to 2 made all tests pass for the above distribution with 1450 reads/gene.
With the new seed I can decrease the coverage down to 1050 reads/gene, with 1000 reads/gene it fails:
[2.3423397125434093e-49, 8.3842653780764e-07, 0.9930133865582756, 0.006985772598874615, 2.416311876962356e-09, 7.142909811719875e-18, 4.249298719014421e-27, 2.128020490573492e-36, 1.529665125994679e-45, 1.6972808555802302e-54]
0.999975523803 1.57014962066e-18
[1.9337275298657052e-13, 0.07511599710141476, 0.9248708412234953, 1.316167468067067e-05, 2.15942271880366e-13, 1.8335154056794553e-22, 6.464940277963904e-32, 2.9548692132656585e-41, 2.1240210831823763e-50, 2.3567644055361504e-59]
0.996699500152 5.17101523477e-10
[5.445906420424138e-36, 0.02757136187327696, 0.9724121552476298, 1.6482878623867444e-05, 4.693166517434825e-13, 5.628028736117595e-22, 2.8751272662183716e-31, 1.439844581444902e-40, 1.0349900543007076e-49, 1.1484008983578323e-58]
0.999600783826 1.11071465112e-13
[5.114366687613752e-27, 0.12226304057112058, 0.877726041780067, 1.0917648569708899e-05, 2.427864535138854e-13, 2.6408041615304474e-22, 1.2044056392971851e-31, 5.850635710496887e-41, 4.2055579121056666e-50, 4.666389270402796e-59]
0.990258959039 3.8932641796e-08
[8.375389879397283e-64, 7.075302713339767e-07, 0.9947185581334966, 0.0052807314836455605, 2.852586511696885e-09, 1.2860165676311315e-17, 1.0201723668924354e-26, 5.429859976318744e-36, 3.903095614729941e-45, 4.330784138177e-54]
0.999986066777 1.64883492031e-19
[9.633196784513316e-82, 1.1603833285547302e-38, 0.7968753347020827, 0.20312402015980108, 6.451381046718425e-07, 1.1477076447406645e-14, 1.890450048930908e-23, 1.735488771503617e-32, 1.40914417980559e-41, 1.5635536160773092e-50]
0.967655921487 4.60468904166e-06
[8.382249820486311e-48, 1.4638318360465356e-14, 0.965785133464737, 0.034214838195987224, 2.8339260909679074e-08, 1.7027616894802297e-16, 1.2236560491630281e-25, 6.864076056473839e-35, 5.086638954579226e-44, 5.644016307975885e-53]
0.99937588889 6.63285370216e-13
[6.2849796963226476e-58, 4.862565208109922e-13, 0.9988097028749195, 0.0011902968134060556, 3.1118809070461324e-10, 1.041421987261419e-18, 8.0634197848745755e-28, 4.807270132733796e-37, 3.562438312332848e-46, 3.952798716499189e-55]
0.999999298489 1.05953636234e-24
[7.11340159570969e-64, 1.716639918101832e-15, 0.9796402406605936, 0.020359736425409575, 2.2913994939366726e-08, 1.2655734168175305e-16, 1.0414451523219336e-25, 6.20891421903753e-35, 4.601129805723843e-44, 5.105306645607496e-53]
0.999785782756 9.2105193862e-15
[8.910644743732078e-50, 2.6750750749879505e-12, 0.9976296398645251, 0.002370359635077837, 4.977219998852762e-10, 2.105502109431252e-18, 1.6439821156505104e-27, 9.507100391814042e-37, 7.045258065360284e-46, 7.817254530907293e-55]
0.999997210719 2.64816886716e-22
[1.4427794928882686e-40, 0.0007630465875072801, 0.99915589382275, 8.105958560408193e-05, 4.138623843588316e-12, 5.437898477731183e-21, 2.7779975755840957e-30, 1.3912026794316578e-39, 1.0000252494501717e-48, 1.1096047639076404e-57]
0.999999716334 2.83275778025e-26
[2.204686925545302e-92, 1.5238860703666414e-15, 0.9898238627751363, 0.010176116790408174, 2.0434453828089074e-08, 1.5689439418836433e-16, 2.0324188551654957e-25, 1.368691389895836e-34, 1.0142718238198925e-43, 1.125412431563754e-52]
0.999947690356 3.27550188599e-17
[1.453907261117179e-34, 1.2726955745451956e-09, 0.9999411992894223, 5.879943555266249e-05, 2.3294591415902308e-12, 4.352944215907435e-21, 2.60566751287584e-30, 1.4616443892475021e-39, 1.0831548524407326e-48, 1.201843438431901e-57]
0.999999998292 3.71946777321e-35
[4.321251740151457e-20, 0.0009715682422631756, 0.9985014930389642, 0.0005269386901857407, 2.858685692938234e-11, 4.0609399419681694e-20, 1.6174079875330156e-29, 7.392533978919659e-39, 5.3139062666015304e-48, 5.896186832906762e-57]
0.999999458138 3.77165252897e-25
[1.274890156463695e-63, 6.671366683422388e-07, 0.9942764999559437, 0.005722829610455318, 3.296932600589564e-09, 1.0903188651831562e-17, 7.893974541810835e-27, 4.2015621878883575e-36, 3.020169769034755e-45, 3.3511101498451015e-54]
0.999983619973 3.14940074231e-19
[6.738637145865925e-41, 0.000831614108024159, 0.9986297656506321, 0.0005386201629445999, 7.839915511395989e-11, 2.2292378807588647e-19, 1.36718339718349e-28, 6.846763374278688e-38, 4.921595072039277e-47, 5.4608874535534965e-56]
0.999999569202 1.50686177938e-25
[1.1152782420863414e-55, 3.863566672452779e-12, 0.9991313072885134, 0.0008686925068164793, 2.0080651850190894e-10, 7.933291543720812e-19, 6.528335622706979e-28, 3.892079749386851e-37, 2.88423442640733e-46, 3.200279454474283e-55]
0.999999626625 8.50276933013e-26
[1.0637639406113212e-95, 5.6499686656525516e-33, 0.3887716585913803, 0.6112212372699878, 7.104138408452432e-06, 2.2345664366529188e-13, 4.354658639629056e-22, 3.679381338625097e-31, 2.897878405917064e-40, 3.2154184968829966e-49]
0.467061842192 0.173525148793
Note that this time it's another gene and we over-estimate by one instead of under-estimating as with the previous seed.
I'll have a more closer look at the offending genes for the two cases described here.
Switching eps
to 10^(-50)
results in 40s & 50s for the fragment size distribution in steps of 30! I'll commit this as a hack, open a new issue to get rid of eps
eventually & push towards finally getting the real bioanalyzer profile working.
eps = 10e-50
results in 100% probabilty for 2
(I'm testing tail_range(2,103,10)
to include the exact correct answer in the tested one to avoid re-writing the unit test for now). This was with the discretized bioanalyzer profile (steps of 5) for simulation. I'll test eps = 10e-200
next.
eps = 10e-200
failed, too.
I'll try eps = 10e-500
next.
If that doesn't work, I guess we'll have to resolve #79 first.
the last two comments refer to the "real" bioanalyzer profile should i assume?
update: how many reads do you generate? it shouldn't brake with a few reads...
Yes. And with eps = 10e-500
I get estimate_length.py:371
: RuntimeWarning: divide by zero encountered in log
, but still the estimated probabilities are [1.0, 0.0, 0.0, ..., 0.0]
.
I suggest we fix #79 in a separate branch (based on current master) and when all tests still pass without the usage of eps
, we'll merge the fix to master , from, there to the simulation branch & try again with the real bioanalyzer profile.
Do you want me to tackle #79 or do you prepare to do this yourself?
ok, it should anyway be straightforward. If you're in the fixing mood please go ahead, otherwise assign me and i'll try to fix it today.
So now it gets interesting:
After pulling in 7db4854, steps of 1, 10 & 20 work as before: mainly 40s, some 50s, maybe a 30 somewhere. For steps of 30 I don't get 10s anymore, but only 2 40s and the rest are 20s (100%). This was with eps = 10e-12
& 1,500 reads/gene.
Switching to eps = 10e-20
resulted in 40s & 50s everywhere again.
I just double-checked that by undoing 7db4854, I get 10s everywhere, re-applying it gives me the 20s + two 40s.
ok so we miss something in the bugfix... i'll try to find out what
I pulled in e5b9a2f, but I still get the same result (20s + two 40s) for steps of 30 with eps = 10e-12
.
As expected (I guess), setting eps = 10e-20
results in many 40s and a few 50s.
There's still a single 20, though.
Setting eps = 10e-25
removes the last 20 and increases the number of 50s compared to 40s.
I don't see how the results can still depend on eps
after 7db4854 & e5b9a2f.
Running your bug_revealed.py
script, I now get 0.0 for 10 and 20 up to 1,000 reads.
At 1,250 reads ('Oops...'), I don't get 100% 10 anymore (still with eps = 10e-12
), but I have a small non-zero probabilty for 20 again.
Maybe that helps to figure out what is going on..?
If I start a python3
session on my machine, 10e-324 == 0
is False
while 10e-325 == 0
is True
.
Just collecting random information that might help.
I tried to set all _preal < eps
to _pused = eps
but flagging them as zero.
Thus, every tail length that has a _preal < eps
for a single read will be set to zero when leaving log space.
Still, I get the 20s as before.
This was done with eps = 10e-12
.
I might have found the offending line but I'll make sure by testing before I tell you. ;)
Embarrassingly enough, with 1a09296 I get 40s & 50s only for steps of 30 with eps = 10e-12
.
thanks a lot! frankly speaking, when i saw you writing this i found it strange that python accepts this :) what about the other tests?
I just didn't think python for a second. It makes perfect sense that this in not valid in python. But I would have expected an error or at least a warning, though.
Other tests running.
I'll go through the checklist of #81 now.
Switching to the real bioanalyzer profile (discretized in steps of 5) still results in 100% for the shortest tail length.
As I have no clue what else it could be, I guess I'll finish off #81 so we can merge that into master (as it clearly improves the implementation) before we (again) start over to make more & more complex test cases to see where it breaks.
man ey... ok thanks. i'll play locally to see if i can come up w sth else
ok check the following (might want to rename the bioanalyzer profile)
pAi = [{'start' : 1000, 'end' : 0, 'strand' : '+', 'is_tail' : True}]
bio_size = []
bio_intensity = []
with open(os.path.join('test_data', 'ds_012_50fix_bioanalyzer.txt'), 'r') as f:
for line in f:
bio_size.append(int(line.split()[0]))
bio_intensity.append(float(line.split()[1]))
f_size, f_prob = discretize_bioanalyzer_profile(np.array(bio_size), np.array(bio_intensity), 150)
# simple and naive uniform reads simulation
reads = np.array([])
for offset in range(1000, 1080):
reads = np.concatenate((reads, np.repeat(offset - f_size, list(np.round(f_prob * 10)))))
Lrange = tail_length_range(10, 500, 20)
print (estimate_poly_tail_length(list(reads), Lrange, pAi, 0, f_size, f_prob, False))
with this tail range, the answer is 10
with a range
Lrange = tail_length_range(10, 500, 10)
the answer is 80! (aka the correct one)
with a range
Lrange = tail_length_range(10, 100, 10)
also correct. Same with ranges that include 80. For all the others which don't include it the answer comes up 10.....
as a quick test, can you check your simulations, even with a few reads, but with tail range range(10, 45, 1)
? and with the real profile
7308c7d finally introduced simulating fragment sizes from the (discretized) bioanalyzer profile. I'll increase the number of reads per gene to make it pass the stringend unit test.
Then we'll finally be able to close this one (finally).
Once #93 is merged & propagated into the master branch, 0d8f3bb will close this issue for good.
Next, I'd like to cross #90 off the list.
see https://github.com/rajewsky-lab/polyA/pull/63#issue-160419099:
42
results in expected value of42
for estimated poly(A) tail length.42
results in shift of the expected value for estimated poly(A) tail length to43
.Any ideas why that could be?
Note that currently in the simulations all oligo-d(T)s anchor in the 3' end of the poly(A)-tail while the model assumes a uniform distribution across the whole poly(A)-tail.
Could this be related?