Closed anushka255 closed 2 weeks ago
I went through the code for this csv/plot and it kinda makes sense - its trying to prove that if we focus our attention to the most frequently sampled molecules (towards the origin on the x-axis), the enrichment factor (observed samples / what we would expect from a distribution where known molecules had a uniform chance of showing up irrespective of sampling frequency) goes up, and it goes down as we start looking at all sampled molecules (i.e. going right along the x axis).
What I'm a little confused about is:
y = deepmet["known"].tolist()
...
obs = sum(y[0:rank])
For the above claim to be true, it seems like we're assuming that the y values are already sorted by decreasing frequency (?). This may well be true based on the previous steps in the pipeline, but if so, perhaps a programmatic check, or at least a comment can be added in the code to make it explicit.
The other thing is the presence of x_rnd
(the random
mode as opposed to the true
mode in the generated csv file).
x_rnd = deepmet.sample(frac=1)["size"].tolist()
This seems to be some sort of control to prove that the number of test molecules we're able to recall is indeed a function of sampling frequency and not some other aspect of the trained model. For it to be useful though, it seems that the true
and random
modes are best plotted on the same graph, which is what the latest commit does:
As for the scatterplot above not following the shapes of the ones in the pdf exactly (looking too linear), it may just be an effect of how we're binning the ranks:
ranks = (
list(range(10, 100, 10))
+ list(range(100, 1000, 100))
+ list(range(1000, 10000, 1000))
+ list(range(10000, 100000, 10000))
+ list(range(100000, 1000000, 100000))
+ list(range(1000000, deepmet.shape[0], 1000000))
+ [deepmet.shape[0]]
)
Perhaps a different binning strategy was used to generate the graph in the pdf? Or the enrichment factor was averaged for all the folds?
All of these points are best confirmed with @skinnider so we're sure we're not misunderstanding anything.
Hi Vineet, in order:
Thanks @skinnider - @anushka255 , bullet point 1 might be good to check - it may not be happening, and I know 3 is not being done for sure - the dots in these scatterplot are for each fold.
Currently this code is generating the following scatterplot of
enrichment_factor
and# of top ranked molecules
on log scale for both axes. (The figure is a result of combination of all theforecast
files from each fold)