Create Library Files python script - missing step

justinjjvanderhooft commented 2 years ago

"7_create_library_files.py"

There is a small bug, as the folder path_library need to be created, e.g. including: if not(os.path.isdir(path_library)): os.makedirs(path_library)

niekdejonge commented 2 years ago

Sorry missed this issue at the time. Thanks for pointing me at this.

The issue is actually slightly different, the directory is made (if needed), but the function does not expect a directory, but instead expects a based file name: So something like: C:\Niek\ms2query_library\gnps_12_15_21 It will than create three files named: C:\Niek\ms2query_library\gnps_12_15_21_ms2ds_embeddings.pickle C:\Niek\ms2query_library\gnps_12_15_21_s2v_embeddings.pickle C:\Niek\ms2query_library\gnps_12_15_21.sqlite So instead of expecting a directory we expect the base of the file to which the different extensions are made.

However, I agree that this is not intuitive for the user. So I will change this to specifying the directory. And making the expected files.

Additionally with #146 it is made a lot easier to create new library files for your own data, it is now possible to do this with just a few lines of code, without needing to run all the notebooks.

guikool commented 2 years ago

Hi I just post my experience in this issue since my problem seems related: After trying to use my own library spectra, the workflow is fine but no data are being created at the final step without any readback error.

library_creator = LibraryFilesCreator(library_spectra,
                                      output_directory="/content/drive/MyDrive/neg_librarylrsv092022/Alllrsv_neg_",  # For instance "data/library_data/all_GNPS_positive_mode_"
                                      ion_mode="negative",
                                      ms2ds_model_file_name="/content/drive/MyDrive/neg_librarylrsv092022/ms2ds_model_GNPS_15_12_2021.hdf5",  # The file location of the ms2ds model
                                      s2v_model_file_name="/content/drive/MyDrive/neg_librarylrsv092022/spec2vec_model_GNPS_15_12_2021.model", )  # The file location of the s2v model
library_creator.clean_up_smiles_inchi_and_inchikeys(do_pubchem_lookup=False)
library_creator.clean_peaks_and_normalise_intensities_spectra()
library_creator.remove_not_fully_annotated_spectra()
library_creator.calculate_tanimoto_scores()
library_creator.create_all_library_files()
Applying default filters to spectra:  98%|█████████▊| 502041/512922 [03:02<00:04, 2410.99it/s]

2022-09-27 10:51:24,681:WARNING:matchms:add_precursor_mz:197,9867 can't be converted to float.

WARNING:matchms:197,9867 can't be converted to float.

2022-09-27 10:51:24,686:WARNING:matchms:add_precursor_mz:No precursor_mz found in metadata.

WARNING:matchms:No precursor_mz found in metadata.
Applying default filters to spectra: 100%|██████████| 512922/512922 [03:08<00:00, 2720.55it/s]
Selecting negative mode spectra: 100%|██████████| 512922/512922 [00:00<00:00, 692109.85it/s]

From 512922 spectra, 373 are removed since they are not in negative mode

Cleaning metadata library spectra:  98%|█████████▊| 502143/512549 [21:06<00:18, 552.89it/s]

2022-09-27 11:12:37,212:WARNING:matchms:add_parent_mass:Missing precursor m/z to derive parent mass.

WARNING:matchms:Missing precursor m/z to derive parent mass.
Cleaning metadata library spectra: 100%|██████████| 512549/512549 [21:22<00:00, 399.50it/s]
Cleaning and filtering peaks library spectra: 100%|██████████| 512549/512549 [03:01<00:00, 2818.53it/s]

From 494673 spectra, 0 are removed since they are not fully annotated

Calculating fingerprints for tanimoto scores: 100%|██████████| 168039/168039 [06:13<00:00, 450.08it/s]

guikool commented 2 years ago

The results was an empty directory: /content/drive/MyDrive/neg_librarylrsv092022/Alllrsvneg

niekdejonge commented 2 years ago

Hi, Thanks for the clear overview of the problem. Did the program finish by itself or did you stop it before finishing completely? After the fingerprints are determined the scores are calculated for 400.000 spectra, but this does not print a progress bar. Which might make it seem like the program is finished, while it is still calculating. I will add an progress bar at the next release, so it is clear that the program is still running.

The loading bars I see (for a small test set) are:

Cleaning metadata library spectra: 100%|██████████| 100/100 [00:00<00:00, 417.27it/s] Cleaning and filtering peaks library spectra: 100%|██████████| 100/100 [00:00<00:00, 3533.38it/s] Calculating fingerprints for tanimoto scores: 0%| | 0/61 [00:00<?, ?it/s]From 100 spectra, 0 are removed since they are not fully annotated Calculating fingerprints for tanimoto scores: 100%|██████████| 61/61 [00:00<00:00, 201.11it/s] Adding spectra to sqlite table: 100it [00:00, ?it/s] Adding inchikey14s to sqlite table: 100%|██████████| 61/61 [00:00<00:00, 3908.00it/s] Converting Spectrum to Spectrum_document: 100%|██████████| 100/100 [00:00<00:00, 3194.88it/s] Calculating embeddings: 100it [00:00, 2133.04it/s] Spectrum binning: 100%|██████████| 100/100 [00:00<00:00, 6381.50it/s] Create BinnedSpectrum instances: 100%|██████████| 100/100 [00:00<?, ?it/s] Calculating vectors of reference spectrums: 0%| | 0/100 [00:00<?, ?it/s] Calculating vectors of reference spectrums: 100%|██████████| 100/100 [00:02<00:00, 39.87it/s]

Does waiting longer solve the issue for you?

niekdejonge commented 2 years ago

I added printing "Calculating Tanimoto scores" Showing a progress bar for this as well would be better, but this is complex to implement with the current implementation of matchms. An issue in matchms was created, to address this problem. Might be implemented in the future.

guikool commented 2 years ago

Thanks for your prompt reply, I've used Google collab notebook. It interrupts just after tanimoto fingerprint and score calculation. I don't see the other progress bar you display in your response (adding spectra to sqlite...) Is there any spectral metadata requirement to complete the process ? I can share with you the notebook if you want to test

here is a sample of my msp file:

NAME: Actinorhodin PRECURSORMZ: 629.083 spectrumid: CHMPS387 PRECURSORTYPE: M-H INCHIKEY: MGFJRQUGYNFFDQ-WYUUTHIRSA-N SMILES: C[C@H]1OC@HCC2=C(O)C3=C(O)C=C(C(O)=C3C(O)=C12)C1=CC(=O)C2=C(C1=O)C(=O)C1=C(CC@@HO[C@@H]1C)C2=O INCHI: InChI=1S/C32H26O14/c1-9-21-15(3-11(45-9)5-19(35)36)29(41)23-17(33)7-13(27(39)25(23)31(21)43)14-8-18(34)24-26(28(14)40)32(44)22-10(2)46-12(6-20(37)38)4-16(22)30(24)42/h7-12,33,39,41,43H,3-6H2,1-2H3,(H,35,36)(H,37,38)/t9-,10-,11+,12+/m1/s1 RETENTIONTIME: CCS IONMODE: Negative INSTRUMENT: qTof INSTRUMENTTYPE: DI-ESI-QTOF COMPOUNDCLASS: ADDUCTIONNAME: LINKS: SOURCEDB: ALL_GNPS.msp ORIGIN: GNPS COLLISIONENERGY: Molecular Formula: C32H26O14 Molar Mass: 634.5416772826582 Num Peaks: 149 197.930786 17.0 197.931046 17.0 197.931305 17.0 197.931564 17.0 ...

niekdejonge commented 2 years ago

I now notice you have quite some unique Inchikeys; 168039. Is this an in house library and does that number of unique inchikeys match with your expectations?

This increase in unique inchikeys might result in some memory issues in google colab. I tested it for up to about 20.000 unique inchikeys. 168039 is quite substantially more and since the tanimoto score is calculated between each inchikey, the size increases to the power of 2.

I think this makes google colab crash, it might be possible to still run this on a server, with more memory available than google colab.

Could you maybe try the workflow with a smaller spectrum file (with e.g. 100 spectra). To make sure the workflow works well in google colab?

If this is indeed the issue, I could have a look at some improvements to reduce the memory footprint of the generation of the Tanimoto scores.

guikool commented 2 years ago

You're probably right, I'm waiting for access to a server to test and will post a feedback asap.

guikool commented 2 years ago

I confirm, it works on collab for 10000 spectra!! I'll use a bigger server to process my entire library Thanks for your help

niekdejonge commented 2 years ago

Great I will also make a less memory intensive implementation. This can be further discussed in #150

iomega / ms2query

Create Library Files python script - missing step #140