usnistgov / UniSpec

Implementation of UniSpec, a deep learning model for predicting full fragment ion peptide spectra.
Other
10 stars 0 forks source link

Problems encountered in UniSpec #2

Open 494686678 opened 3 weeks ago

494686678 commented 3 weeks ago

After obtaining the UniSpec source code and Python scripts from this webpage and the generated datasets and analysis results from Zenodo (https://zenodo.org/records/10052268), I found that when running the predict and train module codes of UniSpec, the specified input files were missing.Could you please provide me with the input sample files at your earliest convenience?

daima2023 commented 3 weeks ago

We appreciate your interest in trying out our model, UniSpec.

To help you solve the problem you are experiencing, we would first like to get more details about your run, for example, whether you specifically ran predict.py to generate a predicted msp (1) based on the provided validation/test datasets or (2) peptide label files. If you agree, please share your predict.yaml and predict.py files. Also, please provide your original error message.

494686678 commented 3 weeks ago

UniSpec_files.zip Thank you for your response.

I have included the predict.yaml, predict.py files, the error message, and the UniSpecPred_Validation-Test files downloaded from Zenodo (https://zenodo.org/records/10052268) in the zip archive. However, when running the code, I found that the msp files specified in the predict.yaml were not fully provided in the UniSpecPred_Validation-Test files, which caused the run to fail. I hope you can provide me with a solution.

daima2023 commented 2 weeks ago

Thank you for the detailed information.

I have prepared a modified example file, predict-dongdi-new.yaml, for you that can help you predict spectra from the ValidUniq dataset, the generated prediction file, _pred_validUniq2022418202333.msp, can be found in the temp subdirectory.

I added some comments in the yaml file, hoping to provide a better understanding of the prediction tasks.

The answer to your error message is provided below.

  1. The error (FileNotFoundError: [Errno 2] No such file or directory: 'UniSpec/UniSpecPred_Validation-Test/UniSpecPred_valuation.msp') means no such directory was found because you used a relative path in your yaml file, which seems incorrect. The actual path shown in your png file may be "_E:/DongDi/UniSpec/UniSpecPred_Validation-Test/UniSpecPredvaluation.msp". If your msp is not in a subdirectory of your current working directory, you should use the full path.

  2. The dsets key is for experimental datasets. These UniSpec datasets are available from the UniSpec Zenoda download folder "UniSpec-Datasets", see readme4UniSpecDatasets.txt.

As a summary, through predict.yaml, we can achieve any of the following three tasks. a. Predict based on the provided experimental dataset and write to an msp file. b. Predict based on a txt file containing spectral labels and write to an msp file. c. Calculate the cosine score between the experimental dataset and the predicted dataset.

Hope this helps and let me know how it goes. predict-dongdi-new.yaml.txt

494686678 commented 2 weeks ago

Thank you for your prompt response. After importing predict-dongdi-new.yaml into the source code, the predict module code is now running successfully and generating MSP files. However, the Train module code fails to run due to a missing specified txt file when reading the Train.yaml file path. I have compressed the Train module source code, Train.yaml file, and the error screenshot into a zip file and sent it to you. I hope you can provide a solution based on the error information. UniSpec_files_message.zip

daima2023 commented 2 weeks ago

Sorry for the missing files. Please check the zip for possible missing files and the modified train_new.yaml.

The number of epochs in train_new.yaml has been changed back from 33 to 20.

missing_files.zip

494686678 commented 1 week ago

Thank you for your response. With your patient explanations and the provided files, the UniSpec training and prediction modules are now running successfully. However, after importing the predict-dongdi-new.yaml.txt file into the prediction module, the Cosine_score values in the generated MSP files are mostly around 0.15. I will provide you with a portion of the content from the MSP format file generated by the prediction module. Could you please let me know if the Cosine_score values are within a normal range? pred_msp_file.zip

494686678 commented 1 week ago

In my research, I would like to use the UniSpec Train module code to train a new model with my own training dataset and then use the predict module code to make predictions on my test dataset with the new model. However, my dataset contains PSM identification result data, which is not in MSP format. Could you please let me know if you are aware of any tools for converting data into MSP format?

Thanks a lot for your kind help!

I would appreciate it very much if you could help me. Thanks a lot!

daima2023 commented 6 days ago

The MSP you uploaded shows that there is a problem with the predicted spectra, so the lower cosine scores are reasonable. If you look at UniSpecPred_valuniq.msp in the UniSpec download folder _UniSpecPredValidation-Test you will see what the correct predictions we provide are. Just by looking at this MSP file I am not sure why the predictions are completely wrong. If you need further assistance please upload all of the specific files you used to generate this file. This will include: predict.py, predict.yaml, the dsets file, and the png files showing the relevant file directory.

daima2023 commented 6 days ago

A related question to your prediction issue is, is this the model you trained or the original model we provided?

494686678 commented 6 days ago

After checking the predict.yaml file, I found that the cause of the error was that the model imported in the file was not the original model you provided. After changing the model in the predict.yaml file to the original model you provided, the MSP file generated by the prediction is roughly the same as the UniSpecPred_valuniq.msp you gave. I am very grateful for your help.

494686678 commented 6 days ago

In my research, I would like to use the Train module code of UniSpec, combined with my own dataset, to train a new model. To do this, I need to convert my dataset into the three types of txt formats required by the Train module. However, I encountered two problems during the conversion process:

1.My dataset only provides NCE values, not EV values. Could you please provide me with a method to convert NCE values to EV values? 2.In the fpostrain file, a series of numeric indices is provided. I would like to know how these numeric indices are obtained and what they specifically represent.

I would appreciate it very much if you could help me. Thanks a lot!

daima2023 commented 5 days ago

Based on your questions, I have the following suggestions:

  1. Our CE conversion is currently limited to Orbitrap instruments (QE, Fusion Lumos, Elite, and Velos). If your data is from these instruments, you might consider calling the function NCE2eV in UTILs.PY.

  2. Regarding the fpostrain file, Joel kindly provided an explanation for an example in fpostest.txt, "0 2596 4865 5959 7656 8378...", as follows: the first number is 0. In a python script (or any other computing language), assume that f is a file pointer object. If I say f.seek(0), then the file will be read starting at byte position 0. That particular position is where the first spectrum starts. After entering f.seek(0), if you then enter f.readline(), you get the first line of the spectrum, i.e. the labels, and then if you enter it again, you get the first peak, i.e. (mz, intensity, ion, etc.). The next number 2596 is the starting byte position of the next spectrum in the file. And so on.

  3. I am not aware of any available tools that can generate NIST MSP files with detailed fragment ion annotations. Note that while OrgMassSpecR can be used to write msp, it does not appear to include fragment annotations.

  4. I recommend that you become familiar with UTILs.PY, a very useful utility script for developing UniSpec.

Hope these can be helpful to you. Please let me know if you have further questions.

jlapin1 commented 4 days ago

Hey there, this is Joel one of the authors of Unispec. Originally we wrote this repository just with the project in mind, and not so much as a general tool for model training, but maybe we can bridge the gap between the data you have and the format that our code expects.

To train a model you need 3 txt files in addition to the config files:

  1. ion_stats_train.txt: this is a list of ions found in your data, in the format {ion_name}/t{occurrence count}/t{mean_intensity}. The code that reads this file in and creates your prediction dictionary is in utils.py:make_dictionary, line 201. You can name this file whatever you want, just make the according changes in the dic.yaml configuration settings.

  2. criteria.txt: This is a list of python statements that must be true for ions to be accepted into your dictionary. If you want all ions in your ion_stats_train.txt file to be predicted, then just put the line occurs>0 at the top of this file.

  3. modifications.txt: This is simply a newline separated list of modifications, which is important for when your model reads in peptide labels and converts the modification string to the corresponding row number for its one-hot representation in the input.

If you create these files directly from your data, you can train right away. If you would rather use our tools from create_dataset_aidata.py to create all these input files for model training, then perhaps you could convert your list of PSMs into an msp format with a little bit of programming. It wouldn't be painless, but certainly not impossible.

494686678 commented 1 day ago

Based on your questions, I have the following suggestions:

  1. Our CE conversion is currently limited to Orbitrap instruments (QE, Fusion Lumos, Elite, and Velos). If your data is from these instruments, you might consider calling the function NCE2eV in UTILs.PY.
  2. Regarding the fpostrain file, Joel kindly provided an explanation for an example in fpostest.txt, "0 2596 4865 5959 7656 8378...", as follows: the first number is 0. In a python script (or any other computing language), assume that f is a file pointer object. If I say f.seek(0), then the file will be read starting at byte position 0. That particular position is where the first spectrum starts. After entering f.seek(0), if you then enter f.readline(), you get the first line of the spectrum, i.e. the labels, and then if you enter it again, you get the first peak, i.e. (mz, intensity, ion, etc.). The next number 2596 is the starting byte position of the next spectrum in the file. And so on.
  3. I am not aware of any available tools that can generate NIST MSP files with detailed fragment ion annotations. Note that while OrgMassSpecR can be used to write msp, it does not appear to include fragment annotations.
  4. I recommend that you become familiar with UTILs.PY, a very useful utility script for developing UniSpec.

Hope these can be helpful to you. Please let me know if you have further questions.

Thank you for patiently answering the questions I raised earlier. Using the methods you provided, I have successfully solved the CE conversion issue and also understood the specific meaning of the numbers in the fpostrain file. However, my current difficulty with the fpostrain file is understanding how the byte positions like "0 2596 4865 5959 7656 8378..." in fpostest.txt are obtained. In the test.txt file, I see that the starting byte position of the first spectrum is 0, and the second spectrum starts at byte position 87, which does not match the numbers in fpostest.txt. I would like to ask if the starting byte position information in the fpostrain file is read from the pre-processed MSP file?

jlapin1 commented 1 day ago

I'm not sure why your fpostest.txt has the numbers that you listed, but I just calculated them using the FPs function in utils.py with a command that looks like this:

pos, labels = E.FPs("input_data/datasets/test.txt", None)

where E is an instance of the EvalObj and pos is a numpy list of integers corresponding to the file positions. I obtained the EvalObj E by setting "mode" in predict.yaml to "interactive" and running predict.py in ipython with

exec(open("predict.py").read())

The correct fpostest.txt is attached.

To prove to yourself that it works. You could open the file

f = open("test.txt")

and then jump to the first spectrum

f.seek(pos[0])

then say f.readline(), and you should see the label for the first spectrum. You could do it again for the second spectrum

f.seek(pos[1]) f.readline()

and again for the 100th spectrum

f.seek(pos[99]) f.readline()

etc.

fpostest.txt