shift in estimations from simulated data

mschilli87 commented 8 years ago

see https://github.com/rajewsky-lab/polyA/pull/63#issue-160419099:

Simulating 400nt fragments only with poly(A) tail length of 42 results in expected value of 42 for estimated poly(A) tail length.
Simulating 90% 400nt fragments and 10% 399nt fragments with poly(A) tail length of 42 results in shift of the expected value for estimated poly(A) tail length to 43.

Any ideas why that could be?

Note that currently in the simulations all oligo-d(T)s anchor in the 3' end of the poly(A)-tail while the model assumes a uniform distribution across the whole poly(A)-tail.

Could this be related?

nukappa commented 8 years ago

i don't observe this bug. can you please simulate reads with insert sizes differing by >= 5 and try again?

here is working example and the model behaves as it should pAi = [{'start' : 650, 'end' : 0, 'strand' : '+', 'is_tail' : True}]

f_size_1 = np.array([100]) f_prob_1 = np.array([1]) f_size_2 = np.array([95, 100]) f_prob_2 = np.array([0.1, 0.9])

reads = [592] * 100

Lrange = range(30, 50, 1)

print (estimate_poly_tail_length(reads, Lrange, pAi, 0, f_size_1, f_prob_1, False)) print (estimate_poly_tail_length(reads, Lrange, pAi, 0, f_size_2, f_prob_2, False))

nukappa commented 8 years ago

the model seems to behave nicely if you change also the reads, i.e. reads = [592, 595] * 100

mschilli87 commented 8 years ago

So this suggests a problem in the simulation then. I'll investigate this further & push potential fixes to the simulation branch.

nukappa commented 8 years ago

did you reproduce a nice behavior with the above script? i think it's nice how the predictions change by stacking up more reads...

mschilli87 commented 8 years ago

I'll first finish off #64 but then I'll look into this in more detail.

mschilli87 commented 8 years ago

Running the following script shows no shift between fixed & varied fragment sizes:

from estimate_length import *
import numpy as np

pAi = [{'start' : 650, 'end' : 0, 'strand' : '+', 'is_tail' : True}]

f_size_1 = np.array([100])
f_prob_1 = np.array([1])
f_size_2 = np.array([95, 100])
f_prob_2 = np.array([0.1, 0.9])

reads = [592] * 100 

Lrange = tail_length_range(30, 50, 1)

probs = estimate_poly_tail_length(reads, Lrange, pAi, 0, f_size_1, f_prob_1, False)
est_pAlen=int(round(sum(Lrange*probs))) # expected value
print(probs)
print(est_pAlen)

probs=estimate_poly_tail_length(reads, Lrange, pAi, 0, f_size_2, f_prob_2, False)
est_pAlen=int(round(sum(Lrange*probs))) # expected value
print(probs)
print(est_pAlen)

However, it predicts 43:

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8990342029304367, 0.09023065785136214, 0.00953609117678319, 0.0010588601503926005, 0.00012326392109555755, 1.5014056117568236e-05, 1.9099138122281647e-06]
43
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1002070435633935e-95, 1.5637274851488367e-96, 1.2434337170168975e-97, 1.0525339191030397e-98, 9.455679264074604e-100, 0.8990342029305111, 0.09023065785128753, 0.009536091176784521, 0.0010588601503916648, 0.00012326392109544861, 1.5014056117554116e-05, 1.9099138122284315e-06]
43

So is...

...there an off-by-one in estimate_poly_tail_length, or
...43 the correct answer for these data?

592+100-650=42 after all. ;)

mschilli87 commented 8 years ago

see https://github.com/rajewsky-lab/polyA/pull/63#issuecomment-226753733

Maybe the remaining deviations simply reflect the sampling noise of how I simulate the fragment lengths: As of now I simply emit each length with the given probability. Instead I could make this depend on previous emissions to ensure reflecting the bioanalyzer profile correctly: For 100 reads with 90% 95nt & 10% 100nt I could simply stop generating 95nt fragments after doing so 10times. Likewise, having only 8 95nt fragments after 98 rounds of simulation could enforce the last two fragments to have this length. I think this will ensure that the profile put into estimate_poly_tail_length actually reflects the length distributions of the observed fragments, as we assume for the model. What do you think?

mschilli87 commented 8 years ago

I adjusted the fragment size sampling forcing the simulation to follow the given profile more closely (also for smaller read counts) with b224a66, but the unit test still fails. For 100 reads I didn't see a clear improvement but it also didn't get worse afaict. I still like this approach more for the reasons mentioned before.

mschilli87 commented 8 years ago

Simulating 1,000 reads/gene I get 42 in all cases in the 90/100-95/100nt setup but it takes very long with the new code. I'll refactor that part to make it a bit faster (+ more easy to read).

nukappa commented 8 years ago

ok i was absorbed by the presentation. when you're somehow ready just assign me to test things out. i think we need to brainstorm at some point about the assumptions. also pcr duplicates are tricky: assume that 200 reads come from the same molecule and hence from a fixed length, and on top of that 40 + 40 + 40 + 40 + 40 reads from 4 different molecules. Obviously putting them all together and assuming uniform hybridization site is a mistake...(which is what happens now in the pipeline)

mschilli87 commented 8 years ago

Refactoring didn't improve the runtime but at least it's more readable now. I'll set a random seed to make all unit test runs comparable & increase the number of reads until the unit test passes. We should still think about a better way to assess correctness for the unit tests than expecting all expected values to be 42.

mschilli87 commented 8 years ago

Setting the seed & increasing the number of reads/gene to 450 resulted in the unit test passing for the following cases:

100% 400nt [currently used]
10% 95nt & 90% 100nt
10% 395nt & 90% 400nt
10% 399nt & 90% 400nt

Next, I'll

[x] add a unit test to ensure the simulated fragment size distribution significantly resembles the one to simulate
[x] switch from comparing all expeced values to 42 to testing that poly(A) tail distributions that were estimated correlates significantly with the one simulated
[x] replace the dummy fragment size distribution with the (discretized?) bioanalyzer profile
[x] adjust the read count to simulate once more

PS: The devations in the probs after setting the seed seem to be relate do rounding issues in the estimation code as the reads simulated were identical in each run.

mschilli87 commented 8 years ago

If I plug in the (discretized) bioanalyzer profile (replace f_size_sim/f_prob_sim with f_size/f_prob), the estimation shifts to 40 with p>0.999. So far I have no clue why that happens. Could you maybe pull 20e2518fb5762b43860ebd75dc1c6194abd9fc14 from my branch to your fork & play a bit around, too?

nukappa commented 8 years ago

i can test it tomorrow, but i have a good clue: the bioanalyzer profile is discretized in steps of 5 and 40 is the closest value to the real value. I guess it improves if you change the bin width?

mschilli87 commented 8 years ago

Discretization in steps of 1 didn't change the result. Even the probability assigned to 40 is > 0.999 again.

mschilli87 commented 8 years ago

While the following length distributions work:

f_size_sim = np.array([300, 400])
f_prob_sim = np.array([.1, .9])

and

f_size_sim = np.array([300, 400])
f_prob_sim = np.array([.5, .5])

The following distribution breaks the unit test:

f_size_sim = np.array([300, 320, 340, 350, 360, 380, 400])
f_prob_sim = np.array([.02, .12, .2, .3, .2, .13, .03])

I think this is a good example to narrow down the problem, as it does work well enough for some genes, get's problematic for others and completely breaks for yet another one:

[1.9969877340810926e-13, 2.8136545867080593e-10, 0.9509683892092384, 0.042675888079516264, 0.0060855196507041635, 0.0002549226666900081, 1.3050444390499167e-05, 1.3261754179595064e-06, 8.845800706808311e-07, 1.891240686283267e-08]
0.999012654723 4.15278663783e-12
[0.2730612417894287, 0.03820488042421049, 0.6754383724389108, 0.01301824387906198,  0.00025718976678013803, 1.878029496819847e-05, 9.031434771263687e-07, 3.0182635211973986e-07, 8.629115842671406e-08, 1.4565203546605682e-10]
0.922423603513 0.000144172525697
[1.2615019624356758e-12, 6.212140867719817e-13, 0.988341957031698, 0.01147686136138714, 0.0001671437081106384, 1.3969344914829683e-05, 6.70304052278555e-08, 1.4115423396444955e-09, 5.358535264895298e-11, 5.647368673262926e-11]
0.999933491623 8.55950766537e-17
[5.84929559327365e-13, 0.04275430900189669, 0.11257306130217708, 0.35461273382048475, 0.4865700322685153, 0.0034728360091119807, 1.6779905059307196e-05, 2.2997959177700293e-07, 1.176788323955293e-08, 5.9446949269363095e-09]
0.0251807979816 0.944951917747

These are posterior probabilites for tail_length_range(40, 50, 1) followed by Pearson's r and the corresponding p-value on the next line for 4 simulated genes with tail length 42 and the above fragment size distribution.

Note how the 1st one looks good, the 2nd is already much more problematic, the 3rd one much better again & the 4th one is already completely off introducing a shift that causes the unit test to fail.

This was really surprising to me given that all these genes are simulated with identical parameters, so the huge deviations in tail length estimation must be caused by random differences between the samples.

This was done with 450 reads per gene but I'll test bigger sample sizes next.

mschilli87 commented 8 years ago

This is what I get simulating 1000 reads per gene for the above fragments size distribution:

[1.9407078351651253e-17, 6.710958231196733e-12, 0.9591893598014722,    0.039974036912426174, 0.0008338967817822885, 2.705049666523851e-06,     1.4465028651225567e-09, 1.4374977149895275e-12, 1.391907033583953e-15, 8.669026579146442e-20]
0.999139424727 2.39709335636e-12
[7.961801741572024e-06, 6.118239856596728e-05, 0.9892921586687393, 0.010624850470933412, 1.3832604954238043e-05, 1.393837962975556e-08, 1.1653771682288547e-10, 1.3779288676696453e-13, 1.0146790296561322e-14, 1.5247164506130794e-16]
0.999943017078 4.61240276628e-17
[5.725308666394131e-26, 8.255539323724488e-20, 0.9997162811649689, 0.00028274673577458195, 9.71990183227256e-07, 1.0902491896001067e-10, 4.828981436542037e-14, 3.996545797932851e-17, 2.227210888064498e-21, 5.619363994693017e-25]
0.999999960529 1.06189319255e-29
[1.1684775623810784e-05, 0.004993671849427283, 0.9931892000676807, 0.0018054172281961123, 2.6038063470840944e-08, 3.955397798842367e-11, 9.756406872844674e-13, 4.717831882648751e-13, 7.15444480069299e-15, 1.2812345342632306e-18]
0.999987002996 1.24837270461e-19
[1.386596059131873e-06, 0.0048596394717116105, 0.9951366093645703, 2.3581282082888647e-06, 6.438780633767851e-09, 6.691950071390118e-13, 8.846353547232302e-16, 1.1261028760558202e-17, 4.238794139237955e-19, 8.107649111612832e-20]
0.999988213141 8.44431146412e-20
[1.3859905992717268e-11, 0.00011586111122621953, 0.999833665320657, 5.024834221984415e-05, 2.2225599136541803e-07, 2.954964267210693e-09, 1.0810183976410995e-12, 3.8539806121387536e-16, 2.6268003913904463e-19, 1.2689994665759576e-22]
0.999999992845 1.14674280366e-32
[7.8465218062249285e-25, 2.231802888460719e-11, 0.9930469153479121, 0.00688876608736662, 6.399819229265967e-05, 3.2022058454095896e-07, 1.2951497893883887e-10, 1.1026387727034506e-14, 3.945861251813544e-17, 1.7599303805687152e-19]
0.999976253399 1.39114224122e-18
[2.9696832529379845e-13, 0.6799830183725007, 0.3199554019211117, 6.138827404338114e-05, 1.9143061251928764e-07, 1.4255202390388715e-12, 8.226342568964062e-15, 6.638749382697589e-16, 3.1581214960559337e-16, 5.778384470381298e-19]
0.340098284122 0.336279530103

So while increasing the sample size >2-fold improved the robustness of the predictions, even at this depth we have clear outliers even for this relatively simple distribution.

Also note that while in the previous case (450 reads/per gene) we over-estimated the tail length in the case of interest, here (1000 reads/gene) we under-estimate it. So I don't think that there is any directionality in this effect (Note that the input distribuion is slightly skewed towards longer fragments).

mschilli87 commented 8 years ago

I committed the switch to the above fragment size distribution incl. the necessary increase of the number of simulated reads per gene from 450 to 1500 (https://github.com/rajewsky-lab/polyA/pull/77/commits/02dae4d534f99ba1feaeb94c0c789c095a9ea77d).

Note however, that with just with 50 reads less (1450 reads/gene) I got the following:

[2.3843768140684402e-27, 2.830751884868016e-13, 0.9970254271913563, 0.0029676234492711554, 6.949136966396309e-06, 2.2210188216486656e-10, 2.1131606113383888e-14, 2.0351379352713304e-18, 4.264777298204097e-23, 2.1862717601872716e-28]
0.999995624645 1.60335207612e-21
[6.3289646781878336e-15, 0.007491590761959842, 0.9925046409156311, 3.724779700591985e-06, 4.3542531867917975e-08, 1.70258778376353e-13, 1.488638556536358e-17, 5.442559879350971e-21, 7.09733039373871e-22, 1.0984823949740132e-27]
0.999971821735 2.75816580933e-18
[1.8171230830746176e-05, 0.0020144330684920566, 0.9972493664259046, 0.0007180284540947389, 8.206765391738758e-10, 1.3291595564133472e-15, 6.984092003962837e-18, 7.073825978590545e-21, 2.1704556987140163e-24, 2.1098280615172175e-28]
0.999997913287 8.29521542483e-23
[6.109485921774387e-11, 0.006801216396084559, 0.9931987802870931, 3.2557087371127057e-09, 1.8738614701913064e-14, 3.478432042264354e-21, 2.814396108234619e-26, 2.5311332142674436e-29, 2.6928324318057627e-32, 1.5445634834233634e-33]
0.999976808877 1.26547353598e-18
[3.3166196248976037e-22, 1.598691247396427e-19, 0.9997361699155801, 0.00026382585095016653, 4.23283600558462e-09, 6.337155753994707e-13, 2.691749902106613e-18, 4.931287704373248e-20, 2.641348421100822e-25, 5.424738025226171e-31]
0.999999965608 6.12108335758e-30
[5.046734670378583e-18, 2.811779674324627e-13, 0.9999999951039438, 4.8956459512295975e-09, 1.2787374535361536e-13, 1.2697029117045974e-15, 1.6486849359036593e-20, 7.278484175744991e-25, 2.467200950927863e-29, 7.773832442675696e-35]
1.0 0.0
[2.8307212303665613e-30, 3.0717115947291643e-24, 0.9994732077176162, 0.0005267899869363544, 2.294108685575821e-09, 1.3388015995174421e-12, 6.551865550055873e-18, 4.5344247032420313e-23, 2.4098031735777854e-27, 3.0775618682611796e-31]
0.999999862799 1.55027926427e-27
[8.654291975308754e-11, 1.5669480614264665e-10, 0.9999999987327134, 1.0225690819976957e-09, 1.4797219482772247e-12, 4.714705261060162e-17, 2.252461401068581e-20, 1.453118649138929e-26, 3.001703823537954e-30, 3.675818481538593e-37]
1.0 0.0
[1.9945467615726222e-23, 3.615742873767955e-15, 0.9999980887603138, 1.911239668935829e-06, 1.361574456387827e-14, 1.5764101206174318e-20, 4.322518999820765e-25, 9.929587876501691e-26, 6.404383082838758e-32, 4.598567481115981e-34]
0.999999999998 4.63025353724e-47
[2.04148338818621e-14, 4.242774506533167e-12, 0.9999946302146393, 5.3697779023597174e-06, 3.1949359764181505e-12, 1.5770484040145607e-16, 1.9266151486215462e-22, 5.6560285663419295e-24, 8.440142812726995e-27, 2.3929317598600458e-32]
0.999999999986 1.79863601749e-43
[0.0004670136287759296, 0.857130971247841, 0.14240180357241863, 2.1155044843357215e-07, 5.160522216443398e-13, 2.350270232296246e-19, 8.428921779277911e-25, 3.7767139856410786e-27, 9.900522334323823e-32, 1.7763925838511122e-34]
0.0552278759741 0.879556835213

So my feeling is that 1500 reads/gene are sufficient for this exact case simulated with the given seed but in any real example this could still easily break.

To me this is not satisfactory and while I could plug in the real bioanalyzer profile and increase the number of simulated reads until the test passes, I think it would be more helpful to figure out if

we have a bug (in the estimation),
the model is so bad that it simply needs an enormous amount of data, or
we run into numerical problems and avoiding those we are good

Any ideas/alternative hypotheses?

nukappa commented 8 years ago

thanks, very helpful. what was the difference between the coverage for the gene that failed? i feel that with low number of reads (i.e. ~200+ but don't know exactly) maybe our precision is within the bioanalyzer values, for this case +-20nts. let's check tomorrow two things:

what makes the gene fail/how different is the coverage
what happens if you plug in a bioanalyzer of seq(300, 360, 10) or even seq(300, 306, 1).

mschilli87 commented 8 years ago

Switching to

f_size_sim = np.array([300, 310, 320, 330, 340, 350, 360])
f_prob_sim = np.array([.02, .12, .2, .3, .2, .13, .03])

resulted in all tests passing with 1500 reads/gene (as expected), but going down to 1450 reads/genes made the test fail again:

[1.5378522916900327e-29, 3.669525313837876e-14, 0.9999899915065682, 9.958875123648815e-06, 4.950991029928015e-08, 1.0828486583850624e-10, 7.623879973092766e-14, 4.2152120539788375e-18, 2.316091228303865e-19, 5.935874718802714e-23]
0.999999999951 2.50539266229e-41
[1.1350016620526029e-16, 0.028790096361439876, 0.9674989468558134, 0.003708678984166392, 2.2777661237221253e-06, 3.2450028568239055e-11, 6.4936303096293416e-15, 4.624455977273954e-18, 1.502430985215576e-20, 4.23973088597953e-24]
0.999566605823 1.54271059481e-13
[0.00016442186875497013, 0.0017845470107763135, 0.9977812008030099, 0.0002687072506299273, 1.1230475423355605e-06, 1.9272801196925463e-11, 1.0974269024454015e-14, 2.8015933409203463e-15, 2.3656908529118917e-18, 3.3869607228983657e-21]
0.999998472012 2.38482601718e-23
[5.788774822920124e-13, 0.00858015026713982, 0.9914198477029613, 2.0290499907677343e-09, 2.6998454971518767e-13, 2.1670031037034664e-18, 4.962466713522687e-23, 2.3130521922903195e-25, 3.7733326584714716e-29, 7.21663317286263e-28]
0.999962943789 8.24902843167e-18
[2.415464971574151e-21, 4.931623025378152e-19, 0.9965189267192426, 0.0034802663107232658, 8.068012973827896e-07, 1.6873535132512604e-10, 1.4225161483059841e-15, 6.747462612427343e-19, 4.2229789889382285e-24, 1.0555729385167392e-27]
0.999993972504 5.77461035783e-21
[3.719231828697996e-18, 5.165881548701048e-15, 0.99999969990882, 2.9506216698350924e-07, 4.9845638902616535e-09, 4.442956744078232e-11, 1.4295576212244435e-14, 6.681018866872057e-17, 5.852532141740759e-19, 2.0027160110841483e-22]
1.0 1.50641932711e-53
[1.2564969096047024e-29, 2.771843014711505e-26, 0.9999903637237828, 9.636274253457086e-06, 1.962497369252916e-12, 1.2699640246446983e-15, 4.4434522141382307e-20, 2.1859822415813603e-24, 3.241914095893032e-27, 7.404472113802038e-33]
0.999999999954 1.93460199173e-41
[1.2062635833458325e-12, 7.593731894903284e-12, 0.9999999996859834, 3.051694178789472e-10, 4.7107730163674816e-14, 3.334564043460014e-17, 7.882865341674034e-21, 7.518039931887295e-26, 1.6426175865422349e-28, 4.431324464962173e-33]
1.0 0.0
[8.721557708691834e-24, 3.457339882393091e-17, 0.9999976364854274, 2.363513699040845e-06, 8.735847769554552e-13, 5.722558239920193e-18, 5.513898607827274e-20, 3.034019970204752e-22, 8.209845107295205e-26, 7.90245149976581e-28]
0.999999999997 2.53346446146e-46
[7.921572231974481e-16, 2.0345611715775155e-11, 0.9999874202800754, 1.2579622139875168e-05, 7.732006716721334e-11, 1.1825726298588344e-13, 2.4812730023893458e-17, 1.4494684958882704e-18, 4.701456779935223e-19, 5.587329315390636e-23]
0.999999999922 1.63179389996e-40
[8.276924346707598e-07, 0.836860388156785, 0.1631198795893221, 1.8904424802964442e-05, 1.3665201787050077e-10, 3.308325212414529e-15, 6.787152812159497e-22, 2.1375350006197484e-25, 9.396963821562087e-31, 7.917798172126044e-33]
0.0840292985015 0.817478317301

Note that it breaks for the same gene with a similar posterior probability distribution.

mschilli87 commented 8 years ago

Here are the results for the following distribution:

f_size_sim = np.array([300, 301, 302, 303, 304, 305, 306])
f_prob_sim = np.array([.02, .12, .2, .3, .2, .13, .03])

1500 reads/gene worked again, 1450 reads/gene fails:

[1.0206594870242274e-36, 4.990426692267784e-09, 0.9999998819418184, 1.1306775491181624e-07, 8.204090795798328e-19, 1.1494196600629723e-31, 3.818618368338791e-45, 1.3421997957701662e-58, 7.640912851599131e-72, 7.918144073093456e-85]
1.0 7.52195880888e-57
[9.121922062668281e-50, 0.009270834931176938, 0.9907288978384194, 2.672304035641155e-07, 2.9475488745752995e-18, 3.17457821305193e-31, 1.0393719354764074e-44, 3.362373703430295e-58, 1.856717337287089e-71, 1.9240836357101977e-84]
0.999956671341 1.54190091581e-17
[4.2120030864984624e-35, 0.00538622007762105, 0.9946136491912408, 1.3073113813870303e-07, 1.0309673716121863e-18, 8.478630412871737e-32, 2.0817365283866143e-45, 6.336423935092587e-59, 3.499000769816702e-72, 3.625953174092908e-85]
0.999985500766 1.9335292232e-19
[2.766099682902426e-64, 1.943757116306812e-05, 0.999976031721393, 4.530707443565408e-06, 2.875117259526414e-16, 7.069200471150592e-29, 2.928414157672681e-42, 9.766428117703482e-56, 5.393063950915207e-69, 5.588737653231438e-82]
0.999999999814 5.21971772449e-39
[3.1653056287750067e-99, 2.607154379901798e-16, 0.9998068181422705, 0.0001931818575855522, 1.4372948576180979e-13, 1.0702051917485102e-25, 8.177510549475042e-39, 3.45065562240439e-52, 1.9643989646476904e-65, 2.035672218909505e-78]
0.999999981563 5.05538714716e-31
[2.23463706359158e-55, 3.1595808346417124e-21, 0.9999667322838613, 3.3267716135376706e-05, 3.2714345223330387e-15, 1.3104917526683922e-27, 5.89100655583306e-41, 2.3910731648811256e-54, 1.403295747291699e-67, 1.4542107886892996e-80]
0.999999999453 3.90472581211e-37
[3.0285123176383265e-83, 1.142207863102893e-23, 0.9993363403292077, 0.0006636596703005937, 4.916736182253376e-13, 4.4531866862525275e-25, 3.082502663278678e-38, 1.370855304973881e-51, 8.045405920146746e-65, 8.337313150886457e-78]
0.999999782176 9.84926030298e-27
[6.365781544786275e-88, 5.749243218232794e-17, 0.9986252424971576, 0.0013747575015723155, 1.2700275487421477e-12, 1.1271921098629669e-24, 7.521568793255464e-38, 2.986292929675179e-51, 1.7000452612854683e-64, 1.761727109190713e-77]
0.999999063831 3.36043090079e-24
[1.905916785034272e-62, 3.065101708546079e-13, 0.9999911275759303, 8.872423762473502e-06, 6.453072763461235e-16, 1.6692399159435738e-28, 7.173036290420014e-42, 2.679603094474846e-55, 1.5254520069392933e-68, 1.5807991796448723e-81]
0.999999999961 9.99195300969e-42
[1.9943853055312862e-42, 1.1096241935552741e-10, 0.9999990873001466, 9.125888908901918e-07, 1.6666146104261207e-17, 3.309213411060597e-30, 1.1684467992916388e-43, 4.1069541496528735e-57, 2.3380184412094812e-70, 2.42284753439568e-83]
1.0 1.25113129857e-49
[2.452391768506468e-28, 0.7470350085783584, 0.25296498583907384, 5.582567820111438e-09, 2.3785377281381264e-20, 2.107549892670725e-33, 5.499642435611269e-47, 1.6739902234649142e-60, 9.243846593230063e-74, 9.579236216431676e-87]
0.223158610983 0.535432991927

Again it is the same gene, and again the profile is slightly better (as in lower certainty for the wrong answer).

mschilli87 commented 8 years ago

Changing the random seed from 42 to 2 made all tests pass for the above distribution with 1450 reads/gene.

mschilli87 commented 8 years ago

With the new seed I can decrease the coverage down to 1050 reads/gene, with 1000 reads/gene it fails:

[2.3423397125434093e-49, 8.3842653780764e-07, 0.9930133865582756, 0.006985772598874615, 2.416311876962356e-09, 7.142909811719875e-18, 4.249298719014421e-27, 2.128020490573492e-36, 1.529665125994679e-45, 1.6972808555802302e-54]
0.999975523803 1.57014962066e-18
[1.9337275298657052e-13, 0.07511599710141476, 0.9248708412234953, 1.316167468067067e-05, 2.15942271880366e-13, 1.8335154056794553e-22, 6.464940277963904e-32, 2.9548692132656585e-41, 2.1240210831823763e-50, 2.3567644055361504e-59]
0.996699500152 5.17101523477e-10
[5.445906420424138e-36, 0.02757136187327696, 0.9724121552476298, 1.6482878623867444e-05, 4.693166517434825e-13, 5.628028736117595e-22, 2.8751272662183716e-31, 1.439844581444902e-40, 1.0349900543007076e-49, 1.1484008983578323e-58]
0.999600783826 1.11071465112e-13
[5.114366687613752e-27, 0.12226304057112058, 0.877726041780067, 1.0917648569708899e-05, 2.427864535138854e-13, 2.6408041615304474e-22, 1.2044056392971851e-31, 5.850635710496887e-41, 4.2055579121056666e-50, 4.666389270402796e-59]
0.990258959039 3.8932641796e-08
[8.375389879397283e-64, 7.075302713339767e-07, 0.9947185581334966, 0.0052807314836455605, 2.852586511696885e-09, 1.2860165676311315e-17, 1.0201723668924354e-26, 5.429859976318744e-36, 3.903095614729941e-45, 4.330784138177e-54]
0.999986066777 1.64883492031e-19
[9.633196784513316e-82, 1.1603833285547302e-38, 0.7968753347020827, 0.20312402015980108, 6.451381046718425e-07, 1.1477076447406645e-14, 1.890450048930908e-23, 1.735488771503617e-32, 1.40914417980559e-41, 1.5635536160773092e-50]
0.967655921487 4.60468904166e-06
[8.382249820486311e-48, 1.4638318360465356e-14, 0.965785133464737, 0.034214838195987224, 2.8339260909679074e-08, 1.7027616894802297e-16, 1.2236560491630281e-25, 6.864076056473839e-35, 5.086638954579226e-44, 5.644016307975885e-53]
0.99937588889 6.63285370216e-13
[6.2849796963226476e-58, 4.862565208109922e-13, 0.9988097028749195, 0.0011902968134060556, 3.1118809070461324e-10, 1.041421987261419e-18, 8.0634197848745755e-28, 4.807270132733796e-37, 3.562438312332848e-46, 3.952798716499189e-55]
0.999999298489 1.05953636234e-24
[7.11340159570969e-64, 1.716639918101832e-15, 0.9796402406605936, 0.020359736425409575, 2.2913994939366726e-08, 1.2655734168175305e-16, 1.0414451523219336e-25, 6.20891421903753e-35, 4.601129805723843e-44, 5.105306645607496e-53]
0.999785782756 9.2105193862e-15
[8.910644743732078e-50, 2.6750750749879505e-12, 0.9976296398645251, 0.002370359635077837, 4.977219998852762e-10, 2.105502109431252e-18, 1.6439821156505104e-27, 9.507100391814042e-37, 7.045258065360284e-46, 7.817254530907293e-55]
0.999997210719 2.64816886716e-22
[1.4427794928882686e-40, 0.0007630465875072801, 0.99915589382275, 8.105958560408193e-05, 4.138623843588316e-12, 5.437898477731183e-21, 2.7779975755840957e-30, 1.3912026794316578e-39, 1.0000252494501717e-48, 1.1096047639076404e-57]
0.999999716334 2.83275778025e-26
[2.204686925545302e-92, 1.5238860703666414e-15, 0.9898238627751363, 0.010176116790408174, 2.0434453828089074e-08, 1.5689439418836433e-16, 2.0324188551654957e-25, 1.368691389895836e-34, 1.0142718238198925e-43, 1.125412431563754e-52]
0.999947690356 3.27550188599e-17
[1.453907261117179e-34, 1.2726955745451956e-09, 0.9999411992894223, 5.879943555266249e-05, 2.3294591415902308e-12, 4.352944215907435e-21, 2.60566751287584e-30, 1.4616443892475021e-39, 1.0831548524407326e-48, 1.201843438431901e-57]
0.999999998292 3.71946777321e-35
[4.321251740151457e-20, 0.0009715682422631756, 0.9985014930389642, 0.0005269386901857407, 2.858685692938234e-11, 4.0609399419681694e-20, 1.6174079875330156e-29, 7.392533978919659e-39, 5.3139062666015304e-48, 5.896186832906762e-57]
0.999999458138 3.77165252897e-25
[1.274890156463695e-63, 6.671366683422388e-07, 0.9942764999559437, 0.005722829610455318, 3.296932600589564e-09, 1.0903188651831562e-17, 7.893974541810835e-27, 4.2015621878883575e-36, 3.020169769034755e-45, 3.3511101498451015e-54]
0.999983619973 3.14940074231e-19
[6.738637145865925e-41, 0.000831614108024159, 0.9986297656506321, 0.0005386201629445999, 7.839915511395989e-11, 2.2292378807588647e-19, 1.36718339718349e-28, 6.846763374278688e-38, 4.921595072039277e-47, 5.4608874535534965e-56]
0.999999569202 1.50686177938e-25
[1.1152782420863414e-55, 3.863566672452779e-12, 0.9991313072885134, 0.0008686925068164793, 2.0080651850190894e-10, 7.933291543720812e-19, 6.528335622706979e-28, 3.892079749386851e-37, 2.88423442640733e-46, 3.200279454474283e-55]
0.999999626625 8.50276933013e-26
[1.0637639406113212e-95, 5.6499686656525516e-33, 0.3887716585913803, 0.6112212372699878, 7.104138408452432e-06, 2.2345664366529188e-13, 4.354658639629056e-22, 3.679381338625097e-31, 2.897878405917064e-40, 3.2154184968829966e-49]
0.467061842192 0.173525148793

Note that this time it's another gene and we over-estimate by one instead of under-estimating as with the previous seed.

I'll have a more closer look at the offending genes for the two cases described here.

mschilli87 commented 8 years ago

Switching eps to 10^(-50) results in 40s & 50s for the fragment size distribution in steps of 30! I'll commit this as a hack, open a new issue to get rid of eps eventually & push towards finally getting the real bioanalyzer profile working.

mschilli87 commented 8 years ago

eps = 10e-50 results in 100% probabilty for 2 (I'm testing tail_range(2,103,10) to include the exact correct answer in the tested one to avoid re-writing the unit test for now). This was with the discretized bioanalyzer profile (steps of 5) for simulation. I'll test eps = 10e-200 next.

mschilli87 commented 8 years ago

eps = 10e-200 failed, too. I'll try eps = 10e-500 next. If that doesn't work, I guess we'll have to resolve #79 first.

nukappa commented 8 years ago

the last two comments refer to the "real" bioanalyzer profile should i assume?

update: how many reads do you generate? it shouldn't brake with a few reads...

mschilli87 commented 8 years ago

Yes. And with eps = 10e-500 I get estimate_length.py:371: RuntimeWarning: divide by zero encountered in log, but still the estimated probabilities are [1.0, 0.0, 0.0, ..., 0.0]. I suggest we fix #79 in a separate branch (based on current master) and when all tests still pass without the usage of eps, we'll merge the fix to master , from, there to the simulation branch & try again with the real bioanalyzer profile. Do you want me to tackle #79 or do you prepare to do this yourself?

nukappa commented 8 years ago

ok, it should anyway be straightforward. If you're in the fixing mood please go ahead, otherwise assign me and i'll try to fix it today.

mschilli87 commented 8 years ago

So now it gets interesting:

After pulling in 7db4854, steps of 1, 10 & 20 work as before: mainly 40s, some 50s, maybe a 30 somewhere. For steps of 30 I don't get 10s anymore, but only 2 40s and the rest are 20s (100%). This was with eps = 10e-12 & 1,500 reads/gene.

mschilli87 commented 8 years ago

Switching to eps = 10e-20 resulted in 40s & 50s everywhere again.

mschilli87 commented 8 years ago

I just double-checked that by undoing 7db4854, I get 10s everywhere, re-applying it gives me the 20s + two 40s.

nukappa commented 8 years ago

ok so we miss something in the bugfix... i'll try to find out what

mschilli87 commented 8 years ago

see https://github.com/rajewsky-lab/polyA/pull/81#issuecomment-232639678

mschilli87 commented 8 years ago

I pulled in e5b9a2f, but I still get the same result (20s + two 40s) for steps of 30 with eps = 10e-12.

mschilli87 commented 8 years ago

As expected (I guess), setting eps = 10e-20 results in many 40s and a few 50s. There's still a single 20, though.

mschilli87 commented 8 years ago

Setting eps = 10e-25 removes the last 20 and increases the number of 50s compared to 40s.

I don't see how the results can still depend on eps after 7db4854 & e5b9a2f.

mschilli87 commented 8 years ago

Running your bug_revealed.py script, I now get 0.0 for 10 and 20 up to 1,000 reads. At 1,250 reads ('Oops...'), I don't get 100% 10 anymore (still with eps = 10e-12), but I have a small non-zero probabilty for 20 again.

Maybe that helps to figure out what is going on..?

mschilli87 commented 8 years ago

If I start a python3 session on my machine, 10e-324 == 0 is False while 10e-325 == 0 is True.

Just collecting random information that might help.

mschilli87 commented 8 years ago

I tried to set all _preal < eps to _pused = eps but flagging them as zero. Thus, every tail length that has a _preal < eps for a single read will be set to zero when leaving log space.

Still, I get the 20s as before.

This was done with eps = 10e-12.

mschilli87 commented 8 years ago

I might have found the offending line but I'll make sure by testing before I tell you. ;)

mschilli87 commented 8 years ago

Embarrassingly enough, with 1a09296 I get 40s & 50s only for steps of 30 with eps = 10e-12.

nukappa commented 8 years ago

thanks a lot! frankly speaking, when i saw you writing this i found it strange that python accepts this :) what about the other tests?

mschilli87 commented 8 years ago

I just didn't think python for a second. It makes perfect sense that this in not valid in python. But I would have expected an error or at least a warning, though.

Other tests running.

I'll go through the checklist of #81 now.

mschilli87 commented 8 years ago

Switching to the real bioanalyzer profile (discretized in steps of 5) still results in 100% for the shortest tail length.

As I have no clue what else it could be, I guess I'll finish off #81 so we can merge that into master (as it clearly improves the implementation) before we (again) start over to make more & more complex test cases to see where it breaks.

nukappa commented 8 years ago

man ey... ok thanks. i'll play locally to see if i can come up w sth else

nukappa commented 8 years ago

ok check the following (might want to rename the bioanalyzer profile)

pAi = [{'start' : 1000, 'end' : 0, 'strand' : '+', 'is_tail' : True}]

bio_size = []
bio_intensity = []

with open(os.path.join('test_data', 'ds_012_50fix_bioanalyzer.txt'), 'r') as f:
    for line in f:
        bio_size.append(int(line.split()[0]))
        bio_intensity.append(float(line.split()[1]))

f_size, f_prob = discretize_bioanalyzer_profile(np.array(bio_size), np.array(bio_intensity), 150)

# simple and naive uniform reads simulation
reads = np.array([])
for offset in range(1000, 1080):
    reads = np.concatenate((reads, np.repeat(offset - f_size, list(np.round(f_prob * 10)))))

Lrange = tail_length_range(10, 500, 20)

print (estimate_poly_tail_length(list(reads), Lrange, pAi, 0, f_size, f_prob, False))

with this tail range, the answer is 10

with a range

Lrange = tail_length_range(10, 500, 10)

the answer is 80! (aka the correct one)

with a range

Lrange = tail_length_range(10, 100, 10)

also correct. Same with ranges that include 80. For all the others which don't include it the answer comes up 10.....

nukappa commented 8 years ago

as a quick test, can you check your simulations, even with a few reads, but with tail range range(10, 45, 1) ? and with the real profile

mschilli87 commented 8 years ago

7308c7d finally introduced simulating fragment sizes from the (discretized) bioanalyzer profile. I'll increase the number of reads per gene to make it pass the stringend unit test.

Then we'll finally be able to close this one (finally).

mschilli87 commented 8 years ago

Once #93 is merged & propagated into the master branch, 0d8f3bb will close this issue for good.

Next, I'd like to cross #90 off the list.

rajewsky-lab / polyA

shift in estimations from simulated data #66

see https://github.com/rajewsky-lab/polyA/pull/63#issue-160419099: