gtluu / timsconvert

https://gtluu.github.io/timsconvert/
Apache License 2.0
28 stars 16 forks source link

slow? #62

Open animesh opened 1 month ago

animesh commented 1 month ago

I am trying to convert some HeLa sample generated raw data and on an average it takes about couple of hours, for example

timsconvert --chunk_size 10000000 --imzml_mode continuous --verbose --input $i 
/cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.d
2024-07-25T23:34:52.358358:Initialize Bruker .dll file...
2024-07-25T23:34:53.368646:Loading input data...
2024-07-25T23:34:53.660046:Reading file: /cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.d
2024-07-25T23:38:29.646452:input: /cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.d
2024-07-25T23:38:29.646670:outdir: /cluster/work/users/ash022/veronica
2024-07-25T23:38:29.646751:outfile:
2024-07-25T23:38:29.646823:mode: centroid
2024-07-25T23:38:29.646893:compression: zlib
2024-07-25T23:38:29.646979:ms2_only: False
2024-07-25T23:38:29.647058:exclude_mobility: False
2024-07-25T23:38:29.647120:encoding: 64
2024-07-25T23:38:29.647181:barebones_metadata: False
2024-07-25T23:38:29.647241:profile_bins: 0
2024-07-25T23:38:29.647301:maldi_output_file: combined
2024-07-25T23:38:29.647360:maldi_plate_map:
2024-07-25T23:38:29.647426:imzml_mode: continuous
2024-07-25T23:38:29.647485:chunk_size: 10000000
2024-07-25T23:38:29.647546:verbose: True
2024-07-25T23:38:29.647604:version: 1.6.5
2024-07-25T23:38:29.647664:infile: /cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.d
2024-07-25T23:38:29.647728:.tdf file detected...
2024-07-25T23:38:29.647804:Processing LC-TIMS-MS data...
2024-07-25T23:38:29.647872:Initializing mzML Writer...
2024-07-25T23:38:30.981298:Initializing controlled vocabularies...
2024-07-25T23:38:31.433716:Writing mzML metadata...
2024-07-25T23:38:31.444200:Writing data to .mzML file /cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.mzML...
2024-07-25T23:38:31.444476:Calculating number of spectra...
2024-07-25T23:38:31.501230:Parsing and writing Frame 1...
2024-07-26T01:40:50.358078:Renaming mzML file...
2024-07-26T01:40:50.361829:Finished writing to .mzML file /cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.mzML...

is this normal? If so, is there way to speed up? I have tried several values for --chunk_size but that doesn't make much difference >10e6? Folder size is about 2G

du -kh /cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.d
282K    /cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.d/7832.m/backup-2024-06-24.m
565K    /cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.d/7832.m
30M     /cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.d/2024-06-24_14-13-19_One-column-separation
1.8G    /cluster/work/users/ash022/veronica/240624_200ngHelaQC_DDAlong_Slot1-54_1_7832.d
gtluu commented 1 month ago

Hi @animesh that speed is pretty typical for larger datasets such as yours (I'm assuming this is proteomics data?). Unfortunately for single files, I have limited ability to implement multithreading; currently I implemented multithreading in a way where multiple files will be converted at once but each file can only be accessed by a single thread. This helps with larger datasets containing multiple files, but not for single file conversion. At this point, if you do not already have one, upgrading to faster storage would be the best solution as the speed ultimate depends on HDD/SSD read/write speeds.

As for the --chunk_size parameter, this affects how many scans are read into memory for conversion at once. If the value is large enough, it essentially just reads in all the scans in the current file. So the maximum --chunk_size is essentially the number of scans in the raw data.

animesh commented 1 month ago

yes @gtluu it is proteome of HeLa cell 👍🏽 SSD is not making much difference it seems but somehow fragpipe seems to convert much faster, roughly about in 20 minutes? i am guessing there code is not open source though?

gtluu commented 3 weeks ago

This may be a limitation due to the fact that TIMSCONVERT is written in Python. Other platforms such as ProteoWizard MSConvert are written in C++, and while I'm not overly familiar with FragPipe, I believe it is written in Java. I would imagine the speed difference is due to these other software using compiled languages. You can find the source code for FragPipe here. I am not fluent in Java or C++ so I can't directly compare the differences in our implementation, but I am continuing performance improvements where I can within the capabilities of Python when possible.