kusterlab / prosit

Prosit offers high quality MS2 predicted spectra for any organism and protease as well as iRT prediction. When using Prosit is helpful for your research, please cite "Gessulat, Schmidt et al. 2019" DOI 10.1038/s41592-019-0426-7
https://www.proteomicsdb.org/prosit/
Apache License 2.0
85 stars 45 forks source link

Problems of training iRT prediction model from scratch using released data #22

Closed wj-zhang closed 5 years ago

wj-zhang commented 5 years ago

Hi! Prosit is a fascinating method to predict both spectra and iRT. I am really interested in it.

In order to do further research based on Prosit model, I was trying to train the iRT prediction model from scratch using the released code and data. However, I met some difficulties as follows,

  1. I downloaded the iRT model file (model_irt_prediction.zip) from the given link https://figshare.com/projects/Prosit/35582. I found the loss function in config.yml is masked_spectral_distance, but the paper says "the mean squared error was used as loss function".

  2. I downloaded the iRT prediction data file (irt_PROSIT.hdf5) from https://figshare.com/projects/Prosit/35582. I found that "X_train" are iRT values whereas "Y_train" are peptide sequences. Most importantly, the numbers of training, validation, and holdout samples are 349136, 87455, 169339 respectively. And their ratios are 57.6%, 14.4%, and 27.9% respectively, but the paper says "The remaining data were split into 64% training data, 16% test data and 20% holdout data". Is there any misunderstanding I have?

  3. Although the trained weight for iRT prediction is released, I want to get comparable results with released weight by training the model on the released data from scratch. I fixed the aforementioned problems such as changed "masked_spectral_distance" to "mean_squared_error" in the file config.yml and exchanged "X_train" and "Y_train" in irt_PROSIT.hdf5. And then I trained the iRT prediction model on the released data. However, for my trained weight, the loss values (mean squared error) on validation and holdout datasets are 0.0229 and 0.0126, while for released weight, the loss values on validation and holdout datasets are 0.0071 and 0.0054.

Can you give me some suggestions about the settings of training the iRT model and the used data?

tkschmidt commented 5 years ago

Hi, sry for my late respond but I was occupied last week. 1-3) This is my fault. IRT and intensity were trained seperatly at the beginning and I didnt use the same Prosit framework as my Co-Author Sigi. Therefore I just transformed my files at the end to his style. 1) error was mean squared error 2-3) let me have a closer look at the scripts that generated the file dataset and I will come back to this issue

tkschmidt commented 5 years ago

Hey, sry for the late response. You can download the original data here https://syncandshare.lrz.de/dl/fiTRXfyD3m3KunKbVpWZrDN2, but I will also update the figshare folder. This should give you similar results. P.s.I compared two models with the small error and a bigger one ~0.011 on an additional extra holdout data set of Proteometools for an upcoming paper and the differences are marginal.

wj-zhang commented 5 years ago

Hey, sry for the late response. You can download the original data here https://syncandshare.lrz.de/dl/fiTRXfyD3m3KunKbVpWZrDN2, but I will also update the figshare folder. This should give you similar results. P.s.I compared two models with the small error and a bigger one ~0.011 on an additional extra holdout data set of Proteometools for an upcoming paper and the differences are marginal.

Hi,

Thank you so much for your help! I really appreciate it!

Best regards

mantouRobot commented 5 years ago

Hi, @tkschmidt,

As you mentioned in the paper, you use two iRT datasets: the figshare released training dataset and the refinement dataset. So the trained model released in figshare is based on the released dataset alone or both?

For the figshare released dataset, the iRT value distribution is as follows: image Obviously, the iRT value are scaled to z-score. But in the paper, it says that scale is for the refinement dataset not for the figshare released dataset. I'm confused.

For the refinement dataset, the iRT value distribution is as follow: image We can see a large proportion of values are around zero. Is't reasonable?

In addition, neither the two datasets contain the C aa. Could you tell me what's happened?

Thanks.