shuzhao-li-lab / asari

asari, metabolomics data preprocessing
Other
38 stars 9 forks source link

Missing peaks #45

Closed xulei99 closed 1 year ago

xulei99 commented 1 year ago

I used this QE data set (https://drive.google.com/drive/folders/1PRDIvihGFgkmErp2fWe41UR2Qs2VY_5G?usp=sharing_eip&ts=5b8ab35f). This is a standard mixture data, and I found some of composite mass track with perfect peaks but have not been detected as peaks (not in both of the full and prefer peak tables). Here are some examples. I used all the default settings, just run this: asari process --input .\data\ standard_mixture_279 standard_mixture_475 standard_mixture_669 standard_mixture_147 standard_mixture_54 standard_mixture_232 standard_mixture_276

Is this because the peaks are too wide? This data set is acquired for 2400s, so most of its peaks are wide. But I didn't find the threshold value for peak width in the source code

shuzhao-li commented 1 year ago

Yes, the default parameters should be good for most studies but not yours. Default also not using disk cache for small sample number, hence your other issue. I'm at ASMS. Josh may have a look at the mean time.

jmmitc06 commented 1 year ago

Thanks for posting this as an issue, we had been talking on email.

Glad to see my suspicion about the peak shape seems to be related. I can take a more detailed look tomorrow and provide some more detailed guidance.

On Wed, Jun 7, 2023, 6:39 PM Shuzhao @.***> wrote:

Yes, the default parameters should be good for most studies but not yours. Default also not using disk cache for small sample number, hence your other issue. I'm at ASMS. Josh may have a look at the mean time.

— Reply to this email directly, view it on GitHub https://github.com/shuzhao-li/asari/issues/45#issuecomment-1581603889, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNJZYIY4YLLBGWPTXYB4N3XKD7KXANCNFSM6AAAAAAY6NFLLE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jmmitc06 commented 1 year ago

I processed the provided 10 files using asari v1.11.4 after converting them to mzML with ThermoRawFileParser. SB1 was the reference sample. No sample emitted an alignment failure.

Using the excel sheet you provided, I checked if I could find the missing features you marked as "missed". Of the seven missing features, I found 4 of them in my results (C16H24N2, C11H8O3, C20H19N3O, C12H9ClN2O4). Arguably the C16H24N2 feature could be easily missed since the observed retention time was off by a few seconds, but I suspect that is because our alignment will shift things slightly.

The other 3 features C20H23N, C20H31NO2 and C13H17NO did not have corresponding features detected but the peaks were seen. Two of them have a side hump, maybe that is impairing detection? C13H17NO has a flat apex.

I can investigate why these three features were not detected but I'm not certain why I found the 4 "missed" features. Can you check we are using the same verison and that I'm using the whole dataset (I have 10 .raw files)?

shuzhao-li commented 1 year ago

There is a default wlen parameter, not suitable for very wide peaks

On Thu, Jun 8, 2023 at 9:58 AM Joshua Mitchell @.***> wrote:

I processed the provided 10 files using asari v1.11.4 after converting them to mzML with ThermoRawFileParser. SB1 was the reference sample. No sample emitted an alignment failure.

Using the excel sheet you provided, I checked if I could find the missing features you marked as "missed". Of the seven missing features, I found 4 of them in my results (C16H24N2, C11H8O3, C20H19N3O, C12H9ClN2O4). Arguably the C16H24N2 feature could be easily missed since the observed retention time was off by a few seconds, but I suspect that is because our alignment will shift things slightly.

The other 3 features C20H23N, C20H31NO2 and C13H17NO did not have corresponding features detected but the peaks were seen. Two of them have a side hump, maybe that is impairing detection? C13H17NO has a flat apex.

I can investigate why these three features were not detected but I'm not certain why I found the 4 "missed" features. Can you check we are using the same verison and that I'm using the whole dataset (I have 10 .raw files)?

— Reply to this email directly, view it on GitHub https://github.com/shuzhao-li/asari/issues/45#issuecomment-1582749146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJGHKF2X7HAZ6LEKA5OKQELXKHSB7ANCNFSM6AAAAAAY6NFLLE . You are receiving this because you commented.Message ID: @.***>

xulei99 commented 1 year ago

Yes, I used 10 .raw files and asari v1.11.4, too. But I used the MSconvert to convert them to mzML, also SB1 was selected as the reference sample. No sample emitted an alignment failure for this data set.

jmmitc06 commented 1 year ago

I've never compared the output from MSconvert with ThermoRawFileParser. I would assume the differences in output would be minor but I can't explain then why you got different results than me.

You should play with the settings for wlen as @shuzhao-li suggests. I did a quick test with a wlen of 200 and it did not fix the problem. I will verify that this value is being propagated properly in asari and get back to you.

jmmitc06 commented 1 year ago

A wlen value of 50 allowed me to find the C20H23N and C20H31NO2 missing features.

The previous /test/parameters.yaml file was a little out of date. It is updated now, you can see the wlen value you need to modify there.

I will keep the issue open, but I think it's now a matter of optimizing that parameter to best match your data.

I'll keep this in mind for the future, maybe there is a more clever way to do this that won't require a parameter.

xulei99 commented 1 year ago

Thank you very much, Joshua, I'll optimize this parameter and try to get these peaks

jmmitc06 commented 1 year ago

Since we have identified that this is a parameter optimization problem, I'm going to close this issue. We will keep this issue in mind moving forward, maybe we can determine a smarter way to define the peak region that will be less parameter-driven.

jmmitc06 commented 1 year ago

After our conversation last night, I'm going to reopen this issue since I understand your situation better now. Basically, you still have some missing peaks, many peaks are not 'preferred' and you want to determine the optimal parameters for your chromatography configuration in general.

jmmitc06 commented 1 year ago

Hi Lei,

Just wanted to follow up on this issue. There were two main problems you reported to me at Metabolomics2023. The first was that many peaks were in the full feature table and not the preferred feature table. The second was that of the ~900 peaks, you were still missing a few (<10).

The first issue, regarding the full and preferred feature table, the partition of features between those two tables is arbitrary and based on peak "goodness". While tweaking parameters can shift features from the full table to the preferred table, at that point, I would argue that you might as well just use the full feature table for your comparisons / analyses. The recommendation to use the preferred table may not be ideal for your analysis. Perhaps we can clarify that in the future. However, I do believe that there is an over-reporting of peaks in datasets so the fact that many of the peaks found by asari are deemed low-quality (i.e., put in the full table and not the full table) is not unsurprising to me.

For the second issue, the missing peaks, are these still the same missing peaks you reported when you opened the issue? At that time there were 7 missing features in the table you provided, changing wlen found most of them. Are we talking about a different set of missing peaks? If so, do they have good peakshapes, good intensity? Maybe examples would help.

jmmitc06 commented 1 year ago

Hi Lei,

Do you have any update on this issue, namely are these the same missing peaks as before?

xulei99 commented 1 year ago

Sorry for late reply, Joshua, just saw this. I tried different wlen values, got 831 peaks (out of 836) with wlen 150 in the full table, but there are 47495 and 74753 featuers in the prefer and full tables. These are the missing five peaks and EICs. mz rt 360.9916 1735.2 178.1354 1748.4 175.0392 1749 307.0461 1621.2 442.0813 1558.8 11_SB1 229_SB4 274_SB1 556_SA1 710_SA4

I also checked the cSelectivity, goodness_fitting and snr values of the common detected peaks, and they vary with different wlen values. I suggest that maybe you can make them settable for users, like we discussed at Metabolomics2023.

jmmitc06 commented 1 year ago

No problem Lei, we have all been busy. Thanks for the additional information. When time permits, I can take a closer look at this issue.

Regarding manually setting the parameters so that more features are in the preferred table, while I agree that the current cutoffs are arbitrary, is there a good reason you want the features to be in the preferred table instead of simply using the full table for your analysis? I know we recommend that most users use the preferred table, but that only holds true if the preferred table has stringent rules on peak quality. It is intentionally a subset of the features and we do know that in some cases it will miss real but poor quality peaks (such as those in your example plots). Simply allowing worse peaks into the preferred table would defeat the point of having a preferred and full feature table.

If you have a fault with that argument please share and we will discuss this internally as well.

Perhaps a filter function for the full feature table will be better than changing what defines a preferred feature?

jmmitc06 commented 1 year ago

Hi Lei,

Just wanted to follow up. We discussed the preferred vs. full feature table issue and for the time being our opinion is to keep the default rules in place and leave more advanced processing, such as what you are describing, to the end user. I suggest working with the full feature table in your case and write your own filter to extract features at your desired level of data quality. If there are additional parameters for the features that need to be reported for you to do this, please let me know and I will make the changes you need.

xulei99 commented 1 year ago

Hi Joshua, sorry for my late reply. Thank your for the suggestion.I'll try that and get to you if I need anything.

-----Original Messages----- From:"Joshua Mitchell" @.> Send time:Wednesday, 08/02/2023 19:06:39 To: shuzhao-li/asari @.> Cc: xulei99 @.>, Author @.> Subject: Re: [shuzhao-li/asari] Missing peaks (Issue #45)

Hi Lei,

Just wanted to follow up. We discussed the preferred vs. full feature table issue and for the time being our opinion is to keep the default rules in place and leave more advanced processing, such as what you are describing, to the end user. I suggest working with the full feature table in your case and write your own filter to extract features at your desired level of data quality. If there are additional parameters for the features that need to be reported for you to do this, please let me know and I will make the changes you need.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

徐雷

jmmitc06 commented 1 year ago

Okay. I'm going to close this issue then. I think that we have reached a conclusion and the further improvements to the algorithm that may enable more robust peak detection are larger in scope than this issue. Of course, feel free to open (or re-open) any issue you may encounter. You also have my contact info, please don't hesitate to reach out.