rickhelmus / patRoon

Workflow solutions for mass-spectrometry based non-target analysis.
https://rickhelmus.github.io/patRoon/
GNU General Public License v3.0
65 stars 19 forks source link

Componentization #71

Closed CorinaMeyer closed 1 year ago

CorinaMeyer commented 1 year ago

Dear Rick,

I just started to use patRoon. Up to the componentization it worked very well. However, for the componentization I get, depending on the algorithm, either errors or it says in the console that it is finished, but the processing does not stop.

For CliqueMS I get the following error:

Annotating all features with CliqueMS for 51 analyses ... Exporting to XCMS features... Done! | | 0%Error in Matrix::tril(adjmatrix, -1) : 'k2' must be an integer from -Dim[1] to Dim[2]

For RAMClustR it counts through all components and ends with ... 1620 of 1637 1630 of 1637 finished but although it says finished, R seems still busy.

Similar it is for CAMERA, where I end up with Calculating possible adducts in 8020 Groups... % finished: 10 20 30 40 50 60 70 80 90 100 where nothing more happens, not even after several minutes of waiting.

I attached you my workflow. workflow_patRoon.txt

Best, Corina

rickhelmus commented 1 year ago

Hello Corina,

For CliqueMS: I did submit a few fixes some time ago, which could be related to this. Unfortunately, they haven't been integrated by the author so far. The handbook suggests to install the version with my patches, maybe you could (re-)install this version just to be safe?

remotes::install_github("rickhelmus/cliqueMS")

For the rest it sounds like the componentization algorithms have a lot of data to chew un. Maybe they simply didn't finish yet, even if things appear to be stuck? I am curious with how many feature groups you are dealing with? Perhaps you could try more strict filters to reduce the number. You could at least do this to confirm if componentization is working at all, for instance, by increasing the intensity threshold during the feature group filter step.

Thanks, Rick

CorinaMeyer commented 1 year ago

Dear Rick,

Thanks for your quick answer. For CliqueMS I still get the same error. I started RAMClustR again, but set a very high intensity filter beforehand. Now I have 54’685 features in 2’970 groups. Is this a lot? From my experience with non-targeted data analysis of untreated wastewater, this is not too much. But I guess it is a lot in comparison to groundwater or something else with significantly less matrix. For me it seems that it is the cacheing which keeps R busy. The cache.sqlite file is constantly increasing in size. Do you have any experience how big this cache file gets with my roughly 3’000 groups? Just to get an estimate of duration.

Thanks again, Corina

From: rickhelmus @.> Sent: Montag, 6. März 2023 14:37 To: rickhelmus/patRoon @.> Cc: Meyer, Corina @.>; Author @.> Subject: Re: [rickhelmus/patRoon] Componentization (Issue #71)

This e-mail was sent to you by someone outside Eawag. You should only click on links or attachments if you are certain that the e-mail is genuine and the content is safe.

Hello Corina,

For CliqueMS: I did submit a few fixes some time ago, which could be related to this. Unfortunately, they haven't been integrated by the author so far. The handbook suggests to install the version with my patches, maybe you could (re-)install this version just to be safe?

remotes::install_github("rickhelmus/cliqueMS")

For the rest it sounds like the componentization algorithms have a lot of data to chew un. Maybe they simply didn't finish yet, even if things appear to be stuck? I am curious with how many feature groups you are dealing with? Perhaps you could try more strict filters to reduce the number. You could at least do this to confirm if componentization is working at all, for instance, by increasing the intensity threshold during the feature group filter step.

Thanks, Rick

— Reply to this email directly, view it on GitHubhttps://github.com/rickhelmus/patRoon/issues/71#issuecomment-1456150354, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6CAULI3W7CLMDHJCER5VY3W2XR65ANCNFSM6AAAAAAVQ6YF5Y. You are receiving this because you authored the thread.Message ID: @.***>

rickhelmus commented 1 year ago

Hi Corina,

About 3000 features groups after the filter() step is a decent amount, but nothing too extreme I would say. How many analyses are you processing? I just tried to run the componentization with the unfiltered demo data (~1500 feature groups, ~8000 features) and it took my system about 2 minutes. So I guess it's not strange that with your feature numbers you would need wait a bit :-)

Since you mentioned you applied a very high intensity threshold now, does that mean you previously had also way more feature groups? I would say having much more than a couple of thousand feature groups after filtering your feature group data may yield to very long processing times in some steps.

For the cache: yes it's normal that this can quickly grow, but of course also within reasonable amounts. What kind of file sizes are we talking about with your workflow? With the demo data test run my cache file ~300 MB, although I have seen it grow to a few GB with more realistic projects.

Thanks, Rick

CorinaMeyer commented 1 year ago

Hi Rick,

I was processing 51 files. I had around 50'000 features. The componentization with RAMClustR took several hours and the cache file has grown to more than 200 GB and I stopped it before it was finished because I do not have so much space. Now I am running with 18 files and the high intensity filter. The cache is now about 5 GB. Currently I am running the generateFormulas function but with more than the default “CHNOP”, namely “CHNOPSFClBrI”. However, this is now running since more than 24 h and it has only progressed to 66%. What also confused me was that it went within few minutes to 41% but took now many hours to come from 41% to 66%. So the progress bar is not proportional to the time, which makes it also very hard for me to estimate how long it will take.

I am wondering what I am doing wrong that it takes so much time and storage space with my dataset. Any idea would be appreciated.

Best, Corina

From: rickhelmus @.> Sent: Dienstag, 7. März 2023 11:06 To: rickhelmus/patRoon @.> Cc: Meyer, Corina @.>; Author @.> Subject: Re: [rickhelmus/patRoon] Componentization (Issue #71)

This e-mail was sent to you by someone outside Eawag. You should only click on links or attachments if you are certain that the e-mail is genuine and the content is safe.

Hi Corina,

About 3000 features groups after the filter() step is a decent amount, but nothing too extreme I would say. How many analyses are you processing? I just tried to run the componentization with the unfiltered demo data (~1500 feature groups, ~8000 features) and it took my system about 2 minutes. So I guess it's not strange that with your feature numbers you would need wait a bit :-)

Since you mentioned you applied a very high intensity threshold now, does that mean you previously had also way more feature groups? I would say having much more than a couple of thousand feature groups after filtering your feature group data may yield to very long processing times in some steps.

For the cache: yes it's normal that this can quickly grow, but of course also within reasonable amounts. What kind of file sizes are we talking about with your workflow? With the demo data test run my cache file ~300 MB, although I have seen it grow to a few GB with more realistic projects.

Thanks, Rick

— Reply to this email directly, view it on GitHubhttps://github.com/rickhelmus/patRoon/issues/71#issuecomment-1457891701, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6CAULJVY2XL3KHZJ5R4T5LW24CBHANCNFSM6AAAAAAVQ6YF5Y. You are receiving this because you authored the thread.Message ID: @.***>

rickhelmus commented 1 year ago

Hi Corina,

Thank you for the additional details.

You are right that so many hours of processing and a cache file of hundreds of GB is very unreasonable! The 50.000 features you mentioned, where they the grouped features (feature groups) or raw features (eg what findFeatures() returns)? And if they were feature groups, was this after the filter() step? If the question is 'yes' to both, I understand what is happening ;-) These amounts would be simply way too much to work with (I guess it would've been nice if patRoon warned you about this). Are you working by any chance with Orbitrap data? I noticed that with Orbitraps the intensity scale is quite different and some more careful optimization is needed. You can have a look at the suspect screening workflow in the patRoon paper to get some inspiration.

The situation with GenForm is a bit tricky. You are right that the progress bar is not linear. The reason for this is that the calculations generally go from low to high m/z. Unfortunately, GenForm tends to produce a lot of candidates for higher feature masses, especially with additional elements specified. There are few things that you can try to remedy this:

  1. Reduce the maximum timeout: by default there is a timeout of 2 minutes for each calculation. You can lower this by setting the timeout option to generateFormulas() (in seconds).
  2. Setting maxCandidates: this is similar to the timeout option: in this case calculations will be stopped when GenForm reaches a maximum number of candidates. The difference with the timeout option is that it will still proceed with the candidates that were generated before the threshold was reached.
  3. Setting calculateFeatures to FALSE. This way calculations are done on a feature group level, instead of per feature. This might be a bit less accurate, but in my experience it still works very well and I'm considering to make this default as this often makes more sense.
  4. Reduce the number of possible elements. Obviously, your study needs to allow this... But any element you specify less can be a huge gain.
  5. Limit the feature masses you look at. For instance, maybe you are only interested up to 600 m/z? In that case you could first run a filter step on the feature groups, e.g. filter(fGroups, mzRange = c(0, 600))
  6. Try SIRIUS instead of GenForm. I think it might be more robust for these kind of scenarios.

Hope this helps!

rickhelmus commented 1 year ago

Closed due to inactivity, feel free to re-open!