Closed niekdejonge closed 2 years ago
Will be solved with #151
@guikool I released a new version which should run a lot less memory intensive, during the tanimoto score calculation. So it is probably best to use this, please let me know if this still gives issues.
launch started on 500K spectra in collab...
oups..crash I definitely need a cluster..
AttributeError Traceback (most recent call last)
[<ipython-input-7-a084f5e74829>](https://localhost:8080/#) in <module>
7 library_creator.clean_peaks_and_normalise_intensities_spectra()
8 library_creator.remove_not_fully_annotated_spectra()
----> 9 library_creator.calculate_tanimoto_scores()
10 library_creator.create_all_library_files()
AttributeError: 'LibraryFilesCreator' object has no attribute 'calculate_tanimoto_scores'
one way to limit the size of the tanimoto square matrix is perhaps to limit the tanimoto score to a given treshold (0.7?)
You can remove the step library_creator.calculate_tanimoto_scores() This was changed in the new version. Now the tanimoto scores are automatically calculated in create_all_library_files()
We actually only need a fraction of the tanimoto scores, so the memory footprint in this version should be reduced a lot (even more than only above 0.7).
I didn't notice the script change.
It seems to work but not possible on google collab due to extensive estimated time calculation:
Calculating Tanimoto scores: 1%| | 846/168039 [37:47<124:30:01, 2.68s/it]
Still, I'll run it on a strong config and let you know the results.
Dear Niek, I just benchmarked the last version of MS2query for library creation. On a 32Go memory based computer, It terminates with the following error:
tanimoto_scores = jaccard_similarity_matrix(fingerprints_1, fingerprints_2)
MemoryError: Allocation failed (probably too large).
I've access to a 256 Go workstation and will make a try, but perhaps there is something to optimize on this part. Best regards G.
Thanks for letting us know. This step is indeed creating a large matrix, which might therefore give some memory issues (number of unique inchikeys squared). However, I never had issues with this before. How many unique InChiKeys did you have in your training spectra?
It is hard for me to change this since this step is not needed for MS2Query but instead is needed for training MS2Deepscore. I had a quick look if this could be easily changed, but it is not straightforward to change this. I will make an issue in MS2Deepscore, about this, so this mitght be changed in the future.
I hope it works on the 256Gb workstation.
I've encountered another issue on the workstation, but related to python install. For the record, I'm trying the library creation without model on 500K spectra... I'll give it a try on the university cluster and let you know.
I finally removed all in-silico spectra from my in house library and work with less than 50K unique inchikey, no problem so far, library creation works really well and fast. In the results.csv, although scoring from the model is important, it could also be useful to have the dot product of crop spectra between library analog and experimental MS/MS query.
Great to hear that it works well now! Thanks for the suggestion to add the dot product. This might indeed be a useful addition. However, my concern is that it might confuse some users on what score they should trust. I will generate a separate issue for this, to discuss if we want to add this to the results.
Currently a matrix with tanimoto scores is generated. However, only the top 10 highest scoring Tanimoto scores are needed for MS2Query.
Suggested change: Do not store the entire matrix with tanimoto scores, instead just store the top 10 highest tanimoto scores. And pass this to the sqlite file generator.