Improved the plotting. The plotting now equally weights each inchikey, by sampling 1 spectrum per inchikey multiple times.
Also updated the pipeline for training and plotting.
The training and calculating of tanimoto and ms2deepscores for validation spectra is now done during a training run, while the plotting can also be easilty run separately. This is helpful, since creating the plots does not take too long and is expected to be changed most (formatting, new ways of visualizaiton etc). While the validation reference scores calculation can take more than an hour, so this should be run directly after training.
To do:
[x] Make it a true histogram (with bars, instead of continuous)
[x] Make it easy to plot more bins by automatically scaling the figure. Using 100 bins is now still giving nice results.
[x] Make the number of bins depending on the number of pairs. This prevents the bins with low number of bins from having too much noise, while still having an almost continous representation for the histograms with many pairs. (especially relevant when going for more than 10 bins).
[x] Automatically use less ms2deepscore bins if a peak would be so high it overlaps another histogram. By making the nr of bins higher the bin area stays the same, but the height decreases, resulting in the bins fitting on the screen again.
[x] Made sure the scaling is stable independent of the data.
[x] Improve clearity of variable names
[x] Added a bargraph showing percentage of total pairs on the side
[x] Added a separate function for the reverse plot. With different predefined bins.
[x] Test and integrate the functions for selecting one spectrum per inchikey into the wrapper pipelines
[x] Add and integrate the plot of matching spectra within the inchikey.
[x] Clean and upload the notebooks and delete the outdated or uninformative ones.
[x] Adjust nr of starting bins to nr of sampling times
Improved the plotting. The plotting now equally weights each inchikey, by sampling 1 spectrum per inchikey multiple times. Also updated the pipeline for training and plotting.
The training and calculating of tanimoto and ms2deepscores for validation spectra is now done during a training run, while the plotting can also be easilty run separately. This is helpful, since creating the plots does not take too long and is expected to be changed most (formatting, new ways of visualizaiton etc). While the validation reference scores calculation can take more than an hour, so this should be run directly after training.
To do: