Memory foodprint creating large tanimoto score files

iomega / ms2query

MS2Query - machine learning assisted library querying of MS/MS spectra

Apache License 2.0

39 stars 9 forks source link

Memory foodprint creating large tanimoto score files #150

Closed niekdejonge closed 2 years ago

niekdejonge commented 2 years ago

Currently a matrix with tanimoto scores is generated. However, only the top 10 highest scoring Tanimoto scores are needed for MS2Query.

Suggested change: Do not store the entire matrix with tanimoto scores, instead just store the top 10 highest tanimoto scores. And pass this to the sqlite file generator.

niekdejonge commented 2 years ago

Will be solved with #151

niekdejonge commented 2 years ago

@guikool I released a new version which should run a lot less memory intensive, during the tanimoto score calculation. So it is probably best to use this, please let me know if this still gives issues.

guikool commented 2 years ago

launch started on 500K spectra in collab...

guikool commented 2 years ago

oups..crash I definitely need a cluster..

AttributeError                            Traceback (most recent call last)

[<ipython-input-7-a084f5e74829>](https://localhost:8080/#) in <module>
      7 library_creator.clean_peaks_and_normalise_intensities_spectra()
      8 library_creator.remove_not_fully_annotated_spectra()
----> 9 library_creator.calculate_tanimoto_scores()
     10 library_creator.create_all_library_files()

AttributeError: 'LibraryFilesCreator' object has no attribute 'calculate_tanimoto_scores'

guikool commented 2 years ago

one way to limit the size of the tanimoto square matrix is perhaps to limit the tanimoto score to a given treshold (0.7?)

niekdejonge commented 2 years ago

You can remove the step library_creator.calculate_tanimoto_scores() This was changed in the new version. Now the tanimoto scores are automatically calculated in create_all_library_files()

We actually only need a fraction of the tanimoto scores, so the memory footprint in this version should be reduced a lot (even more than only above 0.7).

guikool commented 2 years ago

I didn't notice the script change. It seems to work but not possible on google collab due to extensive estimated time calculation: Calculating Tanimoto scores: 1%| | 846/168039 [37:47<124:30:01, 2.68s/it] Still, I'll run it on a strong config and let you know the results.

guikool commented 1 year ago

Dear Niek, I just benchmarked the last version of MS2query for library creation. On a 32Go memory based computer, It terminates with the following error:

tanimoto_scores = jaccard_similarity_matrix(fingerprints_1, fingerprints_2)
MemoryError: Allocation failed (probably too large).

I've access to a 256 Go workstation and will make a try, but perhaps there is something to optimize on this part. Best regards G.

niekdejonge commented 1 year ago

Thanks for letting us know. This step is indeed creating a large matrix, which might therefore give some memory issues (number of unique inchikeys squared). However, I never had issues with this before. How many unique InChiKeys did you have in your training spectra?

It is hard for me to change this since this step is not needed for MS2Query but instead is needed for training MS2Deepscore. I had a quick look if this could be easily changed, but it is not straightforward to change this. I will make an issue in MS2Deepscore, about this, so this mitght be changed in the future.

I hope it works on the 256Gb workstation.

guikool commented 1 year ago

I've encountered another issue on the workstation, but related to python install. For the record, I'm trying the library creation without model on 500K spectra... I'll give it a try on the university cluster and let you know.

guikool commented 1 year ago

I finally removed all in-silico spectra from my in house library and work with less than 50K unique inchikey, no problem so far, library creation works really well and fast. In the results.csv, although scoring from the model is important, it could also be useful to have the dot product of crop spectra between library analog and experimental MS/MS query.

niekdejonge commented 1 year ago

Great to hear that it works well now! Thanks for the suggestion to add the dot product. This might indeed be a useful addition. However, my concern is that it might confuse some users on what score they should trust. I will generate a separate issue for this, to discuss if we want to add this to the results.