skinniderlab / CLM

MIT License
0 stars 0 forks source link

Plots for forecast #202

Closed anushka255 closed 2 weeks ago

anushka255 commented 1 month ago

Currently this code is generating the following scatterplot of enrichment_factor and # of top ranked molecules on log scale for both axes. (The figure is a result of combination of all the forecast files from each fold) fold_enrichment

vineetbansal commented 1 month ago

I went through the code for this csv/plot and it kinda makes sense - its trying to prove that if we focus our attention to the most frequently sampled molecules (towards the origin on the x-axis), the enrichment factor (observed samples / what we would expect from a distribution where known molecules had a uniform chance of showing up irrespective of sampling frequency) goes up, and it goes down as we start looking at all sampled molecules (i.e. going right along the x axis).

What I'm a little confused about is:

y = deepmet["known"].tolist()
...
obs = sum(y[0:rank])

For the above claim to be true, it seems like we're assuming that the y values are already sorted by decreasing frequency (?). This may well be true based on the previous steps in the pipeline, but if so, perhaps a programmatic check, or at least a comment can be added in the code to make it explicit.

The other thing is the presence of x_rnd (the random mode as opposed to the true mode in the generated csv file).

x_rnd = deepmet.sample(frac=1)["size"].tolist()

This seems to be some sort of control to prove that the number of test molecules we're able to recall is indeed a function of sampling frequency and not some other aspect of the trained model. For it to be useful though, it seems that the true and random modes are best plotted on the same graph, which is what the latest commit does:

forecast_roc_3

As for the scatterplot above not following the shapes of the ones in the pdf exactly (looking too linear), it may just be an effect of how we're binning the ranks:

    ranks = (
        list(range(10, 100, 10))
        + list(range(100, 1000, 100))
        + list(range(1000, 10000, 1000))
        + list(range(10000, 100000, 10000))
        + list(range(100000, 1000000, 100000))
        + list(range(1000000, deepmet.shape[0], 1000000))
        + [deepmet.shape[0]]
    )

Perhaps a different binning strategy was used to generate the graph in the pdf? Or the enrichment factor was averaged for all the folds?

All of these points are best confirmed with @skinnider so we're sure we're not misunderstanding anything.

skinnider commented 1 month ago

Hi Vineet, in order:

  1. yes, the first iteration of the code assumes the y values are sorted by frequency in descending order. I agree it would be a good idea to double-check this or perhaps just re-sort anyway.
  2. I agree that the observed and randomized ROC/PR curves should be visualized on the same plot.
  3. re: the scatterplot looking “too linear”: I agree that this is bizarre and my first thought is that it points to some kind of bug. In the code that was used to make the plot in the slides, frequencies were drawn from the ‘freq_avg’ file that aggregates over CV folds (excluding the fold in which the held-out molecule was part of the training set). Is it possible that this scatterplot reflects some train-test set leakage? The ranks are the same ones that were used to generate the plot in the slides, so I don't think this is the issue.
vineetbansal commented 1 month ago

Thanks @skinnider - @anushka255 , bullet point 1 might be good to check - it may not be happening, and I know 3 is not being done for sure - the dots in these scatterplot are for each fold.