Closed NKruszewska closed 2 years ago
I think the second approach is better We will send you soon explanation of how to calculate persistence length based on atom positions.
So we shall calculate entropy from the aggregated histogram over subsequent pairs of mers. I will not recommend plotting entropy over realizations, as we will lose out error calculus and will not be abbe to compare this quantity with others. But what about plotting entropy (with quartile envelopes) over pairs on angles and ions. Then we would have: \Phi_14 vs. \Psi_14 \Phi_13 vs. \Psi_13 for Na^+, Mg^2+ and Ca^2+. Then we will have 6 point on the plot for HA and 6 points on the plot for CS. So we have 2 plots with envelopes. @NKruszewska @Piotr8706 what do you think?
Raised PR #18 with code and placed histograms with all mers combined in the same place in gdrive for Albumin+CS6
I will not recommend plotting entropy over realizations
I subscribe to this point of view. Plotting anything against realisation number is in my opinion wrong as it may cause our readers to assume that our realisations are done in a particular, meaningful order and as far as I understand they are independent.
@soycapitan if you can show us the plot I suggested for entropy for CS6, It should be helpful, thanks.
@soycapitan Our realizations are in order - 001 is best docked (the greatest binding energy), 010 is worst docked. The order is by binding energy after docking. That's why I think that chart Entropy(realizations) could be ok. We can see if one parameter is somehow function of the second one. But the idea of @kdomino is also very good for me. @Piotr8706 what do you think?
@NKruszewska if there is some difference between realizations other than the probabilistic one the situation is different. However I was told, that they differ only by the seed of RNG.
I guess, they are reshuffled with respect to binding energy. If so, we need also to plot binding energy values and make a discussion on it. Still, we lose the probabilistic analysis there, and we can not compare two values as we do not know what is the error of estimation.
I believe that they are sorted via binding energy. @Piotr8706 are they?
@kdomino I can't understand why we can't compute errors? Could you explain it? I though that It is the same situation like previous but we only don't distinguish if the Psi1-4 is inside 1 mer of inside 5 mer. We treat all Psi1-4 as the same angle because it is between the same kind of atoms.
@NKruszewska right, but we did 10
or 12
realizations to have the quantile envelope of entropy. Hence if we plot entropy over realizations we can not have these envelope, that was used to estimate the error. I am right?
But we can do the plot of entropy over realizations, as there is the suggestion that scatter plots for HA may differ between realizations and for SG6 not.
In the case of docking ranking vs. MD-based ranking, we should stick to more physically stable complexes,i.e., time stability. For HA, the order is as follows: 2,7, 10, 1, 5, 3, 9, 6, 8, 4 from best to the worst. Mind that complex numbers vary with the position (and conformation) of HA at the HSA surface, and this matters from the physical point of view.
Do we have 10
or 12
realizations for HA?
We had 12 for HA, but I don't know how many have been included.
12
I think
The order: MD(docking) 1(2), 2(7), 3(10),4(1),5(5), 6(3), 7(9), 8(11), 9(12),10(6), 11(8),12(4)
It becomes complicated and the time is passing by. So my proposition is as follow. For now let us plot entropy for various angles and ions on one plot with envelopes. Then we can discuss that realizations were meant to be independent, but they can by ordered by some number, i.e. docking. Then for e.g. HA we can plot entropy vs. docking numbers and see whether it is monotonic or not. @NKruszewska @Piotr8706, what do you think?
We can do it like that.
@Piotr8706 ok but MD order also depend on ions - so I suppose that you have 3 differed MD orders? Docking order is not depend on ions addition, so I was talking only about your docking order.
@kdomino Yes! Thank you.
You are right; let's keep it by docking ranking
@Piotr8706 so for docking ranking, shall we use some energy as @NKruszewska suggested, or will we have a map realization ->The order
for each ion and polymer.
I think we will just explain that the order presented is due to affinity after docking without showing differences between MD and docking
But we need to know in which order realizations should be plot. So we need the measure of the affinity after docking
Yes, isn't it what I have said?
@soycapitan If file named e.g. "realisation10_Na_hist2D_mainchain_ϕ₁₄_ψ₁₄_b100.png" contains histogram over all angles Phi1-4,Psi1-4 of single realization? I'm asking because on the legend there are small values - the greatest number is 16? I could be wrong but I though that sum of all numbers should be 24000 (moreover is it possible not writing number 16 like 1.6e+01?).
According to the code should be ok, but I had the same concerns, hence I looked many times into the code.
@soycapitan make the assertion test of n.o. data for histograms
@soycapitan was it done? Maybe you can do excel or txt documant besides this graphical form of the histograms? We can see then values inside the matrix and sum it.
@soycapitan @kdomino I'm pretty sure that the histograms not contain angles from all mers insidde one relisation. I tested it in Excel: I took CS_1 with NaCl and plot two first Psi1-4 (column 8vs9 and column 12vs13 on one plot) and I got: On the right it is what we got using your program. Maybe we didn't understand each other what we want to obtain? I wanted to get on X axis col8 + col12 + col16 + ... and on Y axis col9 + col13 + col17 + ... of course one as a function of the second.
@soycapitan Thank you! Can I ask you for dumping txt files of the histograms? I still can't understand what we have on the histograms. In the link above you present hist. for CS6_4 with Ca. I took the source file (.tab) and plot only 5 from 24 columns (so 5000 points): As you see there are many lost angles. I can't understand what is happen with them.
@NKruszewska @soycapitan I think I understand what is wrong. 1 - on the plot we have density, so n.o. samples is equal to area of bin x this what is on the color scale. S as we have 100 x 100 bins each is of size 3.6 x 3.6 \approx 13. So the bin with 25 has approx 300 elements. The thinks in circles appears rarely, so on the slide it may represent to e.g. 0.1 that is not distinguishable from the background. If you plot the scatter-plot, you see rare points but you do not see frequency of frequent points. @NKruszewska we had this issue before, about the frequency plot, remember? We end up with strange renormalisation of histograms.
What I propose:
@kdomino Now I understand everything! Thank you.
@soycapitan we can try first with the log scale
I changes to the log scale, there are following plots, etc. The pull request #30 created. @NKruszewska if you think something is missing there may be the problem with reading columns, we need then to go through the code together.
@kdomino The plots look similar to previous ones. I still can't see most of the angles (I was looking at the last case CS6_4 Ca). There are for example a lot of positive Psi1-4 and all of them are missed in the plot.
@kdomino can you open the common .txt file and write to it the array from which you do the histogram? Maybe it helps to see what is inside? @soycapitan can you sent to @NKruszewska some exemplary file, it looks that we are taking wrong collumns.
@soycapitan @kdomino I think I found our angles:
I changed: plt.hist2d(x, y, bins=numbins, range=[[-180,180],[-180,180]], norm=mpl.colors.LogNorm(vmin=0.1, vmax=100), cmap=plt.cm.YlOrRd) on plt.hist2d(x_data, y_data, bins=numbins, range=[[-180, 180], [-180, 180]], norm=mpl.colors.LogNorm(vmin=0.1, vmax=100), cmap=plt.cm.YlOrRd)
Am I right?
I think that we don't need this LogNorm(). Without normalization plots look like this:
And we were wrong. It is not density - it is the number of angles in each bin. Now sum of all squares is 24k/23k.
Maybe let's normalize the plot but:
em.. well... we then need... you know... numbers written like scientific numbers xE-x as it was on the beginning ;)
Yes, you are!
Great finding, now the histograms look much more populated. I converted your change recommendation into #34
@soycapitan Thank you. My final line is: res = plt.hist2d(x_data, y_data, bins=numbins, range=[[-180, 180], [-180, 180]], density=True, cmap=plt.cm.YlOrRd)
and I used f.write("Sum: "+str(np.sum(res[0]))) to test the results.
I'm not sure which numbers looks better - normalized or not
@NKruszewska Thank you. updated #34 - now with your final line and reintroduced scientific notation.
It looks great now. Thank you. @soycapitan Your previous colors were better then now (now are yellow). You had Do you remember what you were using?
@NKruszewska so we can close this issue
Yes. Closing.
As we are planning what to present in the manuscript I want to share here two proposition and ask for advice (mainly @Piotr8706 and @kdomino) which one is most valuable to use. It turned out that there are two approaches and we earlier didnt notice that: a) as it is done now by @soycapitan: one histogram per 2 columns (so it is 24 histograms for 1-4 angles and 23 hist. for 1-3 angles). Then entropy is presented as a function of number of mer (number of column). We have then 60 such entropy charts (2 chains 10 realisations 3 ions). I have problem to analyze this. Because readers could be interested why the entropy for similar angles are different - it is the same angles between the same chemical groups. Maybe this is because of the proximity to some amino acids of albumin, ect. We are short in time so such analyzis is hard to be done. But we can see a lot from such charts - it is absolutely worth to study this separeted angles. b) one histogram per realization; then the entropy chart can be drawn as a function of realization number. Then we can have only 6 entropy charts and we can compare them with binding energy between albumin and chain (HA or CS). In such situation I can compare the results eg. with Hexaire et al. (2000). In such case we lost some information because we are looking on the chain as a whole. We got some "coarse graining". But we are computing also persistence length of the chain - in such case we also looking on the chain as a whole. But there is little risk - we dont know what we will get and if we really be able to compare the results. @Piotr8706 what did you plan to present at start when you plan this publication?