odea-project / qAlgorithms

GNU General Public License v3.0
6 stars 2 forks source link

Minimal reproducible example #6

Closed LeonSaal closed 3 months ago

LeonSaal commented 4 months ago

Hello there,

a very promising concept your work is based on - I would really like to try the algorithms myself! However I am not experienced in C++.

Could you please supply a minimal reproducible example on how to use qCentroids, qBins and qPeaks with e.g. only the path to an .mzML-file as an input?

Kind regards,

Leon

GeRe87 commented 4 months ago

Hello Leon, Thank you for your interest in our work and for reaching out! We're pleased to hear that you find the concept promising. We're currently in the process of preparing a minimal reproducible example that demonstrates how to use qCentroids, qBins, and qPeaks with an .mzML file as input, even for those who are not experienced in C++. We plan to have this example ready within this month and will keep you updated on our progress. Kind regards, Gerrit

YUANMENG-1 commented 4 months ago

We also especially need the tutorial on qbinning qpeaks with mzML input?

LeonSaal commented 4 months ago

Hello,

@GeRe87 thanks, that's great, that it's already in the making!

@YUANMENG-1 if I understand your question right, I personally wouldn't need separate tutorials, as the algorithms build up on one another i.e. qCentroids $\rightarrow$ qBins $\rightarrow$ qPeaks. However it would be nice, if intermediate results could be easily inspected and or exported.

Kind regards,

Leon

YUANMENG-1 commented 4 months ago

Yes, I would like to be able to check the output at each step🙋‍♂️

GeRe87 commented 4 months ago

@YUANMENG-1 This is already marked on our road. The plan is here to use the algorithms in combination or individually with independent in- and output.

dahoehn commented 3 months ago

I have uploaded compiled binaries for 64 bit Windows / x86 architecture from the qBinning_beta branch. The program will produce separate output files for all intermediate steps, which will only be written to disk if explicitly specified.

YUANMENG-1 commented 3 months ago

I have uploaded compiled binaries for 64 bit Windows / x86 architecture from the qBinning_beta branch. The program will produce separate output files for all intermediate steps, which will only be written to disk if explicitly specified.

Sorry, but where can I find the qBinning_beta branch? I can oonly find the https://github.com/GeRe87/qBinningJl

dahoehn commented 3 months ago

Sorry, but where can I find the qBinning_beta branch? I can oonly find the https://github.com/GeRe87/qBinningJl

https://github.com/odea-project/qAlgorithms/tree/qBinning_beta

You can switch branches using the button under the project title on its github page.

YUANMENG-1 commented 3 months ago

I have tried for half a day but failed to find a reasonable way to compile qBinning_beta. There will always be some errors after cmake and make. Maybe I am not experienced in C++, could you please give me some instructions for installation and use

dahoehn commented 3 months ago

I have tried for half a day but failed to find a reasonable way to compile qBinning_beta. There will always be some errors after cmake and make. Maybe I am not experienced in C++, could you please give me some instructions for installation and use

Without knowing more about your specific problem, i would assume you have not set your compiler path in cmake. For that, run "cmake -DCMAKE_CXX_COMPILER='<path to g++>'" (refer to this article: https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_COMPILER.html#variable:CMAKE_%3CLANG%3E_COMPILER) You should then be able to use the command "cmake -B \<directory where you want the build files> - S ." in the root directory of the project. Then, run "cmake --build \<your build directory>". It could be there has been a misunderstanding, however: I have already uploaded compiled binaries, so you do not need to compile it yourself to begin with. You can download qAlgorithms.exe under "releases" on our github page. (https://github.com/odea-project/qAlgorithms/releases).

If you want to compile yourself, please provide the error messages you are getting and which commands you are entering.

LeonSaal commented 3 months ago

Hi @dahoehn,

I tried running the compiled binary from the releases, but without success: Screenshot 2024-08-29 120444

I'm using Microsoft Windows 11 Pro (10.0.22631 Nicht zutreffend Build 22631, x64)

Is there anything I can try to get it to run?

Kind regards,

Leon

dahoehn commented 3 months ago

Hello Leon,

I have uploaded a statically compiled version of the qBinning beta branch and the three required libraries from my msys64 installation. You should be able to run the program if all four files are in the same directory. I tested it on the Windows 11 home laptop of a colleague who did not have any c++ development environment installed, so it should work on your system also. Please let me know if it works.

Kind regards, Daniel

LeonSaal commented 3 months ago

Hi Daniel,

Thanks for the new release! Now I can execute the binary! However, I get an error when specifying the output directory (with -o and in the guided execution):

...>qAlgorithms.exe
Enter "-h" for a complete list of options.
Enter a filename to process that file. You must select a .mzML file:    pos_ILIS_500_µgpl-r002.mzML
pos_ILIS_500_µgpl-r002.mzML

Enter the output directory or "#" to use the input directory:  #
terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: Cannot convert character sequence: Illegal byte sequence

With command-line arguments (./output exists):

...>qAlgorithms.exe -i pos_ILIS_500_µgpl-r002.mzML -o output
Error: the output directory cannot be a file.

...>qAlgorithms.exe -i pos_ILIS_500_µgpl-r002.mzML -o ./output
Error: the output directory cannot be a file.

...>qAlgorithms.exe -i pos_ILIS_500_µgpl-r002.mzML -o ./output/
Warning: no output files will be written.
terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: Cannot convert character sequence: Illegal byte sequence

I also tried quoting the paths and used backslashes, which did not help. Am I specifying the path wrong or are some characters not allowed? In any case I think -o output should be understood as a directory without the slashes and . in front.

Kind regards,

Leon

dahoehn commented 3 months ago

Hello Leon, you could try using .\qAlgorithms.exe -i .\pos_ILIS_500_µgpl-r002.mzML -o .\output\ -pp I agree that ending the directory with a slash should not be necessary, this suggestion will be integrated into the program once i find the time for it. If none of that works, try using the absolute path. I could be i did not use the filesystem library correctly here. If you want output files to be printed, make sure to add the respective flag to your command - in this case -pp (or -printpeaks) for the final peak table to be written to a file. Also note that -i always searches your directory recursively if you give it a folder. All characters in your filename should be fine, as long as it doesn't start with a "-". While i haven't tested other non-ASCII characters, the µ is not a problem, at least on linux. Since windows handles filenames as UTF-16 instead of 8, this might also be a problem for you. If the command above even with absolute paths doesn't work, try removing the µ from your filenames.

I hope this helps, Daniel

YUANMENG-1 commented 3 months ago

Hello Leon,

I have uploaded a statically compiled version of the qBinning beta branch and the three required libraries from my msys64 installation. You should be able to run the program if all four files are in the same directory. I tested it on the Windows 11 home laptop of a colleague who did not have any c++ development environment installed, so it should work on your system also. Please let me know if it works.

Kind regards, Daniel

Thank you very much for releasing the latest version and providing all the necessary DLL files—I was able to run everything successfully!

I used the following command: .\qAlgorithms.exe -i ..\SZ22_Dataset -o .\output\ -ps -pb -pp -e This generated three output files with the suffixes _bins, _e_peaks, and _summary.

In the peak detection results from the _e_peaks file, is dqsBin an indicator of the bin quality, and is dqspeaks an assessment of peak detection quality from qPeaks?" image

dahoehn commented 3 months ago

I'm happy to hear everything worked as intended! Regarding the scores: dqsCen gives you information on how well the points in the profile data match a gaussian shape. Every centroid has its own score, and the dqsCen of a peak is the mean of all centroids that were used to construct the peak. A high score means that the centroids a peak was constructed from are likely to be real signals and not noise artefacts.

dqsBin describes how well separated a mass trace is using a modified shilouette score. For every point in the bin, we consider the mean distance in m/z to all points that are within the same bin and within three scans, as well as the closest point not in the bin which is within three scans. We use three scans since our peak finding does not interpolate larger gaps, and every point four scans or more apart is fully separated anyway. The higher the dqsBin, the more dissimilar is the mass trace which a given peak was constructed from from its direct environment. we calculate the dqsBin per centroid and take the mean for a peak. A high dqsBin correlates with reliable peaks, but we are still working on more model-specific quality parameters here. It only tells you how well separated a bin is from the rest of the dataset, not if it contains a lot of noise or multiple mass traces. Naturally, both correlate with a worse score, but it isn't a complete description of bin quality just yet.

dqsPeak is determined the same as dqsCen, only that now the centroids within a bin are checked for gaussian shape. We use a model that accounts for tailing and fronting. A higher score indicates a greater reliability of the peak, which does imply the detected signal has an overall high quality.

The short answer is yes to both, but it isn't an absolute assessment. I hope this clears things up. For more information on how exactly our scores can be used to assess your result quality, refer to qcentroids - https://link.springer.com/article/10.1007/s00216-022-04224-y qbinning - https://pubs.acs.org/doi/10.1021/acs.analchem.3c01079 qpeaks - https://pubs.acs.org/doi/10.1021/acs.analchem.4c00494

I am surprised to see that your dqsCen scores are 0. Could you post the output written to log_qBinning_beta.csv here? It should be in the same directory as qAlgorithms.exe.

dahoehn commented 3 months ago

@LeonSaal I tried using a µ in the filename on windows and it does not work. This seems to be a problem with the library, so i sadly cannot offer a quick fix. If there are any non-ASCII characters in the path, try replacing them. Spaces in the full path also seem to be a problem on windows, although neither replicates your error message.

YUANMENG-1 commented 3 months ago

I'm happy to hear everything worked as intended! Regarding the scores: dqsCen gives you information on how well the points in the profile data match a gaussian shape. Every centroid has its own score, and the dqsCen of a peak is the mean of all centroids that were used to construct the peak. A high score means that the centroids a peak was constructed from are likely to be real signals and not noise artefacts.

dqsBin describes how well separated a mass trace is using a modified shilouette score. For every point in the bin, we consider the mean distance in m/z to all points that are within the same bin and within three scans, as well as the closest point not in the bin which is within three scans. We use three scans since our peak finding does not interpolate larger gaps, and every point four scans or more apart is fully separated anyway. The higher the dqsBin, the more dissimilar is the mass trace which a given peak was constructed from from its direct environment. we calculate the dqsBin per centroid and take the mean for a peak. A high dqsBin correlates with reliable peaks, but we are still working on more model-specific quality parameters here. It only tells you how well separated a bin is from the rest of the dataset, not if it contains a lot of noise or multiple mass traces. Naturally, both correlate with a worse score, but it isn't a complete description of bin quality just yet.

dqsPeak is determined the same as dqsCen, only that now the centroids within a bin are checked for gaussian shape. We use a model that accounts for tailing and fronting. A higher score indicates a greater reliability of the peak, which does imply the detected signal has an overall high quality.

The short answer is yes to both, but it isn't an absolute assessment. I hope this clears things up. For more information on how exactly our scores can be used to assess your result quality, refer to qcentroids - https://link.springer.com/article/10.1007/s00216-022-04224-y qbinning - https://pubs.acs.org/doi/10.1021/acs.analchem.3c01079 qpeaks - https://pubs.acs.org/doi/10.1021/acs.analchem.4c00494

I am surprised to see that your dqsCen scores are 0. Could you post the output written to log_qBinning_beta.csv here? It should be in the same directory as qAlgorithms.exe. If dqsCen seems to be 0, it is not easy to find out what the reason is, can I directly use dqsPeak to filter the peak of low quality? How much good is this threshold?

image
dahoehn commented 3 months ago

Thanks for providing the log - with the exception of the DQSC it looks normal. Which instrument did you use, and how did you convert to mzML?

YUANMENG-1 commented 3 months ago

I am using a ground truth dataset from QE, converted to mzML with msconvert, oh the data is not originally profile.

Thanks for providing the log - with the exception of the DQSC it looks normal. Which instrument did you use, and how did you convert to mzML?

dahoehn commented 3 months ago

I am using a ground truth dataset from QE, converted to mzML with msconvert, oh the data is not originally profile.

Thanks for providing the log - with the exception of the DQSC it looks normal. Which instrument did you use, and how did you convert to mzML?

That's the issue then, we currently need profile spectra since our binning depends on the uncertainty measure generated during centroiding. If you supply centroids, it is set to 5ppm. I recommend against using pre-centroided data, since the estimated uncertainty within qCentroids ranges between 0.25 and 10 ppm regularly.

If dqsCen seems to be 0, it is not easy to find out what the reason is, can I directly use dqsPeak to filter the peak of low quality? How much good is this threshold?

(I overlooked this earlier) We don't supply you with a threshold, since these would always heavily depend on the data and your application. As a general rule, any peak listed in the final table is, with an alpha of 0.01, a real peak within your dataset. Whether this peak corresponds to a chemical compound or not is another question. The best way to use our scores currently is as a measure of priority during your following analysis.

YUANMENG-1 commented 3 months ago

I am using a ground truth dataset from QE, converted to mzML with msconvert, oh the data is not originally profile.

Thanks for providing the log - with the exception of the DQSC it looks normal. Which instrument did you use, and how did you convert to mzML?

That's the issue then, we currently need profile spectra since our binning depends on the uncertainty measure generated during centroiding. If you supply centroids, it is set to 5ppm. I recommend against using pre-centroided data, since the estimated uncertainty within qCentroids ranges between 0.25 and 10 ppm regularly.

If dqsCen seems to be 0, it is not easy to find out what the reason is, can I directly use dqsPeak to filter the peak of low quality? How much good is this threshold?

(I overlooked this earlier) We don't supply you with a threshold, since these would always heavily depend on the data and your application. As a general rule, any peak listed in the final table is, with an alpha of 0.01, a real peak within your dataset. Whether this peak corresponds to a chemical compound or not is another question. The best way to use our scores currently is as a measure of priority during your following analysis.

So, does this mean that centroid data cannot be used with this tool?

Or can it be set to 5 ppm? Where should I add the "set to 5 ppm" option in the command .\qAlgorithms.exe -i ..\ST002454_BloodyMary21 -o .\output\ -pp -e? When I checked the help menu with -h, I didn't find an option to add the 5 ppm setting.

Also, I’d like to ask, since many recent instruments like the QE only output centroid data, if qAlgorithms only accepts profile data, wouldn't that limit its applicability? I'm also testing another peak detection tool, 3D-MSNet, which also only accepts profile data. Could this be a trend for the future?

dahoehn commented 3 months ago

As stated, it does support centroid data, but the 5ppm assumed centroid error is hard-coded currently. The main problem we face with centroids is that we have no information regarding the centroiding algorithm, and as such have limited angles to maximise our output correctness. There are no user options to set any algorithm parameters, although i could add it for this case. While we do plan to keep qAlgorithms compatible with centroided data, it will not be a focus of development.

Assuming by QE you mean the Q Exactive branded orbitrap mass spectrometers offered by thermo, the .raw file should contain the profile spectra. Using msconvert, set MS levels 1-1 as the only filter option and convert to .mzML with default settings.

LeonSaal commented 3 months ago

Hello,

@dahoehn, thanks for the tip with the µ, that error message is now gone! However I still get an error when running it :/ I tried it with the call from @YUANMENG-1 and got the following output:

>qAlgorithms.exe -i .\input\ -o .\output\ -ps -pb -pp -e
terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: cannot make canonical path: Invalid argument []

Supplying the absolute path doesn't change the error message.

I put an .mzML in .\input, that I converted with ProteoWizard: image

Any ideas where I went wrong this time?

Kind regards,

Leon

dahoehn commented 3 months ago

Hello Leon, You could try opening powershell and setting your encoding to UTF-8 with the command 'chcp 65001'. then start the program with a '.\' in front of qAlgorithms.exe. Admittedly, I would be surprised if that solves it. Does the full path to your files only contain ASCII characters?

The export is correct, and if the file were damaged you'd still get a notice that it was read.

dahoehn commented 3 months ago

@LeonSaal I uploaded a new executeable with different path representation, you can try that one. I have also made it so that just entering -o output should work, as per your suggestion. @YUANMENG-1 The new executeable allows you to set a global centroid uncertainty using the -ppm flag. If you don't specify anything, it remains set to 5. Usage: qAlgorithms.exe -i <...> -o <...> -ppm 5 If the flag is set twice, the last used value is taken.

LeonSaal commented 3 months ago

@dahoehn, thanks for the quick help! Using PowerShell instead of the command prompt to launch qAlgorithms solved the problem, even without setting the encoding.

Kind regards,

Leon