Histograms - Githubissues

NKruszewska commented 2 years ago

As we are planning what to present in the manuscript I want to share here two proposition and ask for advice (mainly @Piotr8706 and @kdomino) which one is most valuable to use. It turned out that there are two approaches and we earlier didnt notice that: a) as it is done now by @soycapitan: one histogram per 2 columns (so it is 24 histograms for 1-4 angles and 23 hist. for 1-3 angles). Then entropy is presented as a function of number of mer (number of column). We have then 60 such entropy charts (2 chains 10 realisations 3 ions). I have problem to analyze this. Because readers could be interested why the entropy for similar angles are different - it is the same angles between the same chemical groups. Maybe this is because of the proximity to some amino acids of albumin, ect. We are short in time so such analyzis is hard to be done. But we can see a lot from such charts - it is absolutely worth to study this separeted angles. b) one histogram per realization; then the entropy chart can be drawn as a function of realization number. Then we can have only 6 entropy charts and we can compare them with binding energy between albumin and chain (HA or CS). In such situation I can compare the results eg. with Hexaire et al. (2000). In such case we lost some information because we are looking on the chain as a whole. We got some "coarse graining". But we are computing also persistence length of the chain - in such case we also looking on the chain as a whole. But there is little risk - we dont know what we will get and if we really be able to compare the results. @Piotr8706 what did you plan to present at start when you plan this publication?

Piotr8706 commented 2 years ago

I think the second approach is better We will send you soon explanation of how to calculate persistence length based on atom positions.

kdomino commented 2 years ago

So we shall calculate entropy from the aggregated histogram over subsequent pairs of mers. I will not recommend plotting entropy over realizations, as we will lose out error calculus and will not be abbe to compare this quantity with others. But what about plotting entropy (with quartile envelopes) over pairs on angles and ions. Then we would have: \Phi_14 vs. \Psi_14 \Phi_13 vs. \Psi_13 for Na^+, Mg^2+ and Ca^2+. Then we will have 6 point on the plot for HA and 6 points on the plot for CS. So we have 2 plots with envelopes. @NKruszewska @Piotr8706 what do you think?

soycapitan commented 2 years ago

Raised PR #18 with code and placed histograms with all mers combined in the same place in gdrive for Albumin+CS6

soycapitan commented 2 years ago

I will not recommend plotting entropy over realizations

I subscribe to this point of view. Plotting anything against realisation number is in my opinion wrong as it may cause our readers to assume that our realisations are done in a particular, meaningful order and as far as I understand they are independent.

kdomino commented 2 years ago

@soycapitan if you can show us the plot I suggested for entropy for CS6, It should be helpful, thanks.

NKruszewska commented 2 years ago

@soycapitan Our realizations are in order - 001 is best docked (the greatest binding energy), 010 is worst docked. The order is by binding energy after docking. That's why I think that chart Entropy(realizations) could be ok. We can see if one parameter is somehow function of the second one. But the idea of @kdomino is also very good for me. @Piotr8706 what do you think?

kdomino commented 2 years ago

@NKruszewska if there is some difference between realizations other than the probabilistic one the situation is different. However I was told, that they differ only by the seed of RNG.

kdomino commented 2 years ago

I guess, they are reshuffled with respect to binding energy. If so, we need also to plot binding energy values and make a discussion on it. Still, we lose the probabilistic analysis there, and we can not compare two values as we do not know what is the error of estimation.

NKruszewska commented 2 years ago

I believe that they are sorted via binding energy. @Piotr8706 are they?

@kdomino I can't understand why we can't compute errors? Could you explain it? I though that It is the same situation like previous but we only don't distinguish if the Psi1-4 is inside 1 mer of inside 5 mer. We treat all Psi1-4 as the same angle because it is between the same kind of atoms.

kdomino commented 2 years ago

@NKruszewska right, but we did 10 or 12 realizations to have the quantile envelope of entropy. Hence if we plot entropy over realizations we can not have these envelope, that was used to estimate the error. I am right?

But we can do the plot of entropy over realizations, as there is the suggestion that scatter plots for HA may differ between realizations and for SG6 not.

Piotr8706 commented 2 years ago

In the case of docking ranking vs. MD-based ranking, we should stick to more physically stable complexes,i.e., time stability. For HA, the order is as follows: 2,7, 10, 1, 5, 3, 9, 6, 8, 4 from best to the worst. Mind that complex numbers vary with the position (and conformation) of HA at the HSA surface, and this matters from the physical point of view.

kdomino commented 2 years ago

Do we have 10 or 12 realizations for HA?

Piotr8706 commented 2 years ago

We had 12 for HA, but I don't know how many have been included.

kdomino commented 2 years ago

12 I think

Piotr8706 commented 2 years ago

The order: MD(docking) 1(2), 2(7), 3(10),4(1),5(5), 6(3), 7(9), 8(11), 9(12),10(6), 11(8),12(4)

kdomino commented 2 years ago

It becomes complicated and the time is passing by. So my proposition is as follow. For now let us plot entropy for various angles and ions on one plot with envelopes. Then we can discuss that realizations were meant to be independent, but they can by ordered by some number, i.e. docking. Then for e.g. HA we can plot entropy vs. docking numbers and see whether it is monotonic or not. @NKruszewska @Piotr8706, what do you think?

Piotr8706 commented 2 years ago

We can do it like that.

NKruszewska commented 2 years ago

@Piotr8706 ok but MD order also depend on ions - so I suppose that you have 3 differed MD orders? Docking order is not depend on ions addition, so I was talking only about your docking order.

@kdomino Yes! Thank you.

Piotr8706 commented 2 years ago

You are right; let's keep it by docking ranking

kdomino commented 2 years ago

@Piotr8706 so for docking ranking, shall we use some energy as @NKruszewska suggested, or will we have a map realization ->The order for each ion and polymer.

Piotr8706 commented 2 years ago

I think we will just explain that the order presented is due to affinity after docking without showing differences between MD and docking

kdomino commented 2 years ago

But we need to know in which order realizations should be plot. So we need the measure of the affinity after docking

Piotr8706 commented 2 years ago

Yes, isn't it what I have said?

NKruszewska commented 2 years ago

@soycapitan If file named e.g. "realisation10_Na_hist2D_mainchain_ϕ₁₄_ψ₁₄_b100.png" contains histogram over all angles Phi1-4,Psi1-4 of single realization? I'm asking because on the legend there are small values - the greatest number is 16? I could be wrong but I though that sum of all numbers should be 24000 (moreover is it possible not writing number 16 like 1.6e+01?).

kdomino commented 2 years ago

According to the code should be ok, but I had the same concerns, hence I looked many times into the code.

kdomino commented 2 years ago

@soycapitan make the assertion test of n.o. data for histograms

NKruszewska commented 2 years ago

@soycapitan was it done? Maybe you can do excel or txt documant besides this graphical form of the histograms? We can see then values inside the matrix and sum it.

NKruszewska commented 2 years ago

@soycapitan @kdomino I'm pretty sure that the histograms not contain angles from all mers insidde one relisation. I tested it in Excel: I took CS_1 with NaCl and plot two first Psi1-4 (column 8vs9 and column 12vs13 on one plot) and I got: On the right it is what we got using your program. Maybe we didn't understand each other what we want to obtain? I wanted to get on X axis col8 + col12 + col16 + ... and on Y axis col9 + col13 + col17 + ... of course one as a function of the second.

NKruszewska commented 2 years ago

@soycapitan Thank you! Can I ask you for dumping txt files of the histograms? I still can't understand what we have on the histograms. In the link above you present hist. for CS6_4 with Ca. I took the source file (.tab) and plot only 5 from 24 columns (so 5000 points): As you see there are many lost angles. I can't understand what is happen with them.

kdomino commented 2 years ago

@NKruszewska @soycapitan I think I understand what is wrong. 1 - on the plot we have density, so n.o. samples is equal to area of bin x this what is on the color scale. S as we have 100 x 100 bins each is of size 3.6 x 3.6 \approx 13. So the bin with 25 has approx 300 elements. The thinks in circles appears rarely, so on the slide it may represent to e.g. 0.1 that is not distinguishable from the background. If you plot the scatter-plot, you see rare points but you do not see frequency of frequent points. @NKruszewska we had this issue before, about the frequency plot, remember? We end up with strange renormalisation of histograms.

What I propose:

change the slider to the log scale,
change to normalized histograms, there will be then no problem with counting,
perhaps if not sufficient distinguish a color of the 0 background on the histograms

NKruszewska commented 2 years ago

@kdomino Now I understand everything! Thank you.

kdomino commented 2 years ago

@soycapitan we can try first with the log scale

kdomino commented 2 years ago

I changes to the log scale, there are following plots, etc. The pull request #30 created. @NKruszewska if you think something is missing there may be the problem with reading columns, we need then to go through the code together.

Albumin+CS6_1_analysis_Ca_hist2D_ϕ₁₄_ψ₁₄ Albumin+CS6_4_analysis_Ca_hist2D_ϕ₁₄_ψ₁₄

NKruszewska commented 2 years ago

@kdomino The plots look similar to previous ones. I still can't see most of the angles (I was looking at the last case CS6_4 Ca). There are for example a lot of positive Psi1-4 and all of them are missed in the plot.

NKruszewska commented 2 years ago

@kdomino can you open the common .txt file and write to it the array from which you do the histogram? Maybe it helps to see what is inside? @soycapitan can you sent to @NKruszewska some exemplary file, it looks that we are taking wrong collumns.

NKruszewska commented 2 years ago

@soycapitan @kdomino I think I found our angles:

I changed: plt.hist2d(x, y, bins=numbins, range=[[-180,180],[-180,180]], norm=mpl.colors.LogNorm(vmin=0.1, vmax=100), cmap=plt.cm.YlOrRd) on plt.hist2d(x_data, y_data, bins=numbins, range=[[-180, 180], [-180, 180]], norm=mpl.colors.LogNorm(vmin=0.1, vmax=100), cmap=plt.cm.YlOrRd)

Am I right?

NKruszewska commented 2 years ago

I think that we don't need this LogNorm(). Without normalization plots look like this:

And we were wrong. It is not density - it is the number of angles in each bin. Now sum of all squares is 24k/23k.

Maybe let's normalize the plot but:

em.. well... we then need... you know... numbers written like scientific numbers xE-x as it was on the beginning ;)

soycapitan commented 2 years ago

Yes, you are!

Great finding, now the histograms look much more populated. I converted your change recommendation into #34

NKruszewska commented 2 years ago

@soycapitan Thank you. My final line is: res = plt.hist2d(x_data, y_data, bins=numbins, range=[[-180, 180], [-180, 180]], density=True, cmap=plt.cm.YlOrRd)

and I used f.write("Sum: "+str(np.sum(res[0]))) to test the results.

I'm not sure which numbers looks better - normalized or not

soycapitan commented 2 years ago

@NKruszewska Thank you. updated #34 - now with your final line and reintroduced scientific notation.

Albumin+CS6_3_analysis_Mg_hist2D_ϕ₁₄_ψ₁₄

NKruszewska commented 2 years ago

It looks great now. Thank you. @soycapitan Your previous colors were better then now (now are yellow). You had Do you remember what you were using?

kdomino commented 2 years ago

@NKruszewska so we can close this issue

NKruszewska commented 2 years ago

Yes. Closing.

iitis / polymer_entropy

Histograms #17