hhabra / metabCombiner

metabCombiner R Package: Paired Untargeted Metabolomics Feature Matching & Data Concatenation
10 stars 2 forks source link

difference in labelRows() output between v 1.2.2 and 1.5.2+ #14

Closed Phylloxera closed 2 years ago

Phylloxera commented 2 years ago

Hi,

I've been doing quite a bit of modeling with metabCombiner and noticed that with my recent biocmanager:: update, my old code stopped working.

Before, labelRows(method = "mzrt") worked on my combinedTable() dataframe

Now, i get Error in labelRows(ct_tp, method = "mzrt", delta = c(de1, de2, de3, de4), : must call calcScores before using this function

The former behavior seems more intuitive to me since I am not using method="scores".

I'm not sure if I should treat my old modeling results as bad or try to replicate them with certain calcScores() parameter values though I don't know what these would be.

Do you have any insight or direction for me on this question?

hhabra commented 2 years ago

Hi, Generally labelRows (and the analogous reduceTable functions) is used after calcScores to generate the most likely feature matches, based on score calculations. The "method" argument determines competitive matches between two or features with respect to another feature in the complementary data, so by itself it doesn't reduce the list of features.

Could you tell me the sequence of commands/ script that you're using to perform your analysis?

Thanks, Hani

Phylloxera commented 2 years ago

Here is the sequence for metabCombiner v 1.2.2 -arbitrarily selected parameter values for example:

bg <- 0.00025; de1 <- 0.0007; de2 <- 5.33
de4 <- 0.24; mrterr <- 11.57; faml <- "gaussian"; tm <- 0.003; wx <- 0.01; wy <- 0.01; tq <- 0.4; rtq <- 0.9
de3 <- 0.003; prop <- 0.5; co <- 2; ifi <- 2; outl <- "MAD"

-then get the set of unambiguous feature matches associated with those parameters with method="mzrt" passed to labelRows():

library(metabCombiner)
dc <- metabCombiner(xdata = mymb[[v2]][["mbdata"]], ydata = mymb[[v3]][["mbdata"]], binGap = bg)
sa <- selectAnchors(dc, windx = wx, windy = wy, tolmz = tm, tolQ = tq, tolrtq = rtq)
tp <- fit_gam(sa, fam = faml, outlier = outl, iterFilter = ifi, coef = co, prop = prop)
ct_tp <- combinedTable(tp)
tr_sc1 <- labelRows(ct_tp, method = "mzrt", delta = c(de1, de2, de3, de4), maxRankX = mrx, maxRankY = mry, maxRTerr = mrterr)
ua1 <- subset(tr_sc1, labels == "")

this results in 144 unambiguous matches that I can then evaluate downstream. I know the Error message in later versions of the package is telling me that I need to calcScores() after fit_gam() and it looks like I can combinedTable() after labelRows(). but it's not clear to me whether there are something like default values of the parameters passed to calcScores() which would give me these same 144 matches in later versions of the metabCombiner package. I messed around a little with these parameters and looked at the documentation but really have no clue. Thanks for taking a look.

Aaron

hhabra commented 2 years ago

Okay based on this script I've got a couple of recommendations:

1) unless you're really looking for data of identical mass, that binGap (bg) may be a bit too small. I'd recommend you set a m/z binGap value of between 0.001 to 0.01.

2) Based on some parameters, I see that your retention times are measured in seconds, correct? It is my mistake for not explicitly mentioning this in the vignette/ documentation, but default values for retention times in the program assume that retention times are measured in minutes. If you're willing to stick with seconds, we will have to make a few adjustments to our parameters (i.e. multiplying or dividing by 60); otherwise I'd recommend converting to minutes.

3) I'd recommend looking at an image of the fit to see (qualitatively) how well-aligned the data and model is. This code will do the job (see ?plot_fit for more details).

plot(tp, main = "My Alignment", xlab = "data 1", ylab = "data 2")

4) You do need to call calcScores() before running labelRows(). This may have happened after version 1.2.2, but I have set the default values for the arguments: A (m/z error weight) at 75, B (RT error weight) at 10, and C (relative abundance error weight) at 0.25. Depending on how well-aligned your data is (see: step 3), you can increase B to 15-20 (for non-identically acquired data) or even 30+ (for chromatographic methods that are close to identical). If you've got some gaps in the plot or it just doesn't align well, lower it to between 5 and 10. I will note again that values are tested for when retention times are measured in minutes.

5) If you want an automated method for reducing the table to the most accurate matches, I'd recommend using the "reduceTable" function. It's a newer version of labelRows() with similar arguments that performs almost exactly the same functionality, but the program "resolves" conflicting alignments. You can still use labelRows() to get the program's initial annotations as well as potential alternative possibilities, but reduceTable gives you only the 1 to 1 matches.

Code like this will do it: tr_sc1 <- reduceTable(ct_tp, method = "mzrt", delta = c(de1, de2, de3, de4), maxRankX = mrx, maxRankY = mry, maxRTerr = mrterr)

Please let me know if anything doesn't make sense. I'm working on improving the package and adding documentation to new functionality since the initial release.

Hani

Phylloxera commented 2 years ago

Hi Hani,

I've been modeling, so I haven't been trying to optimize parameters for a single one-off alignment, though I realize this is your main use case. Rather, I've been identifying which parameter sets give me alignments with different downstream properties (for instance, optimizing correlation among replicate qc samples). So, these are really to be considered arbitrary parameters and I don't really want to focus too much on those. I'd just like to know if it is possible to obtain the same result with metabCombiner version 1.5.2+ that I was getting with version 1.2.2.

regarding your questions: yes, I realize small bg values give smaller alignments, but I think that is a good thing when trying to replicate results across package versions. The 144 unambiguous matches should be easier to check for reproducibility across package versions than 944. yes, my retention times are in minutes. I got paranoid when you mentioned it so I double checked all my new and old metabCombiner objects and my qc steps for building. All looks good. I either read in one one of your documents that rt was in minutes or I deduced based on maxRTerr doesn't really seem to do anything useful with values > 20. plot() results look the same between metabCombiner versions, so we are reproducible there.

I tried just using calcScores(tp) -so using just defaults, but this either gives 77 matches (my method) or 110 matches calcScores() default > reduceTable() default > combinedTable (default). So I'm still not able to replicate my results from version 1.2.2 using that method. reduceTable() does not allow for method or complex 4-part delta arguments like in your example code. running that code gives errors like Error in if (!is.numeric(delta) | delta > 1 | delta < 0) stop("argument 'delta' must be a numeric value between 0 & 1") : the condition has length > 1 and Error in reduceTable(cs, method = "mzrt", delta = c(de1, de2, de3, de4), : unused argument (method = "mzrt")

Do you think it's possible for my version 1.2.2 results to be reproduced in 1.5.2+?

hhabra commented 2 years ago

I thought you were using seconds since your mrterr (11.57) and your de2 (5.33) seemed very high for retention times measured in minutes. I guess that's fine then.

As of version 1.3.1+, labelRows/ reduceTable does not run if all of the values in the column are exactly 1. Perhaps the best workaround is to run this command before using your original labelRows() call:

cs <- calc_scores(tp, A = 0.01, B = 0, C = 0)

Alternatively, replace one of the values in the score column of the combinedTable to a value between 0 and 1 (not inclusive), preferably a row that you want to eventually discard. Either method will get the labelRows function to work without making heavy use of score calculations.

It looks like I did not add the method argument to reduceTable, thanks for catching that. I'll have to fix that before the next release. Fortunately, labelRows can perform identically to reduceTable if you set both of the 'resolveConflicts' and 'reduce' arguments to TRUE, i.e.

tr_sc1 <- labelRows(ct_tp, method = "mzrt", delta = c(de1, de2, de3, de4), maxRankX = mrx, maxRankY = mry, maxRTerr = mrterr, resolveConflicts = TRUE, reduce = TRUE)

However, if you are only looking to reproduce the previous results of 1.2.2, just keep both arguments at FALSE (default) and proceed as before. Let me know if that performs to your satisfaction.

Phylloxera commented 2 years ago

Thanks so much for probing this with me. A few observations using version 1.60 in R-4.2.1:

  1. Having a C value < 0.0001 results in a scores value of 1, which then results in the original reported error when attempting labelRows().
  2. Having a small C value >= 0.0001 results in variable scores value near but not exactly 1, which results in a variable number of hits when using subset() with labels == "" (my version 1.2.2 code). No value of C that I can find, reproduces the 144 matches.
  3. Assigning ct_tp$scores to a fixed value other than 1 reproduces the 144 matches when combined with my version 1.2.2 subset code as you predicted. But not for values 0 to 1 non-inclusive. Rather, the result was reproduced for values > 0.49 and less than 1, but also for fixed scores values > 1. Any fixed scores values < 0.5 results in no hits as all rows appear to be labeled REMOVE.
  4. Lastly, I think you perhaps meant remove = TRUE as reduce is not a labelRows() option.

My cliffs notes summary so far might still be that minimal use of scores appears to still be using scores when scores is variable... and that the only apparent way to avoid a variable scores column is to insert code that assigns scores to a fixed value that is >0.49, but not 1. This might be desirable behavior even when method="mzrt" but still does not seem the most intuitive to me. Also, assigning resolveConflicts and remove both to TRUE does seem to fit with my intentions once I'm comfortable with the reproducibility of unambiguous sets of matches across versions and how those versions are treating the alternative arguments passed to the methods option.

Thank you for your support. If you are willing, I'd like to leave this issue open while we probe any violations of predicted results subsets across versions. Let me know if you want me to switch away from version 1.6.0 for this testing. I may not have as much time to test over the coming few weeks due to deadlines and time off work.

hhabra commented 2 years ago

For #3 I meant to replace only one single value of ct_tp$scores to a value other than 1 (evading the program's check for calcScore's usage), but I suppose anything from 0.5 to 0.9999 will reproduce your original code. The reason is that there is an argument in the labelRows function (minScore) which is intended to eliminate all feature pairs below a certain threshold, which is set to 0.5 by default. Therefore setting all values to a value below 0.5 will result in the entire set being labeled as "REMOVE". Come to think of it, I should probably account for the case in which the entire table gets eliminated

I added the scores equal to 1 check to ensure that the user performs the calcScores step before using labelRows function. The scores are an intrinsic and necessary to ensure that labelRows works as intended. For this exercise, we made minimal use of scores to see if your code matched up to the previous program version, but in general labelRows will not give accurate results without doing score calculations. The "method = mzrt" is simply to determine conflicting hypotheses among high scoring matches, particularly among features with close m/z and RT distances. However his happens only after applying thresholds for scores and ranks.

And yes, sorry, I did mean to use remove = TRUE. You'd think I know my own program's parameters by now ;)

Do let me know if you catch any additional errors or issues to the program.

Phylloxera commented 2 years ago

So you intend that the user always uses calcScores() and always uses a value of C >= 0.0001? regardless of the method option which will later be passed to labelRows()? Is that right?

hhabra commented 2 years ago

The first statement is correct. labelRows and the analogous reduceTable are designed to use scores, ranks, and RT prediction errors to make annotation decisions, all of which is handled within calcScores. The "method" argument controls the detection of alignment conflicts, but only after satisfying thresholds for the above three criteria. In other words, if hypothetical feature A from dataset X is grouped with features B and C from dataset Y, rows containing the A-B and A-C have to: a) be above minScore (i.e. 0.5), b) have pairwise ranks below maxRankX and maxRankY (e.g. 3 and 3), c) retention time error less than maxRTerr (e.g. 0.5 minutes), and finally d) B and C must have close m/z and RT values (method = "mzrt") or their score differential must be small (method = "score").

As for the second statement, you only need one of the three values (A, B, C) to be greater than 0. You can set C = 0 if you wish. We set all three to near 0 previously to replicate the conditions of your earlier script, but these functions are not as useful when such values are used.

Phylloxera commented 2 years ago

OK, That is all really useful information for me and thank you for humoring me with all my follow-ups... So, in a previous post, you had suggested I use the code: cs <- calc_scores(tp, A = 0.01, B = 0, C = 0) , which I translated to: cs <- calcScores(tp, A = 0.01, B = 0, C = 0) and ran since I think calcScores() is what you meant.

But contrary to your statement that C is allowed to be 0 so long as either A or B is not, running this code as translated resulted in a fixed metabCombiner_object@combinedTable$score value of 1, which then resulted in the original error when attempting labelRows(): Error in labelRows(cs, method = "mzrt", delta = c(de1, de2, de3, de4), : must call calcScores before using this function

It appears that labelRows() is fine with a fixed value of scores of 0.99 or 1.01 for all rows, though we had to introduce this by tricking the program in order to reproduce the version 1.2.2 result. but the program holds that a fixed metabCombiner_object@combinedTable$score value of 1 indicates that calcScores was not run though it can also indicate that calcScores was run with A = 0.01, B = 0, and C = 0 as in the suggestion... at least in my example.

However, A = 0.03, B = 0, and C = 0 did result in a result without errors as it seems to produce a variable score value:

cs_test <- calcScores(tp, A = 0.03, B = 0, C = 0)
lr <- labelRows(cs_test, method = "mzrt", delta = c(de1, de2, de3, de4), 
                maxRankX = mrx, maxRankY = mry, maxRTerr = mrterr)
unique(cs_test@combinedTable$score)

[1] 1.0000 0.9999 Though it did not replicate my version 1.2.2 result.

So, maybe it would be more accurate to say that you intend that all users use calcScores() with some combination of A, B, and C, that will result in variation in metabCombiner_object@combinedTable$score? is that closer to the mark?

hhabra commented 2 years ago

I think I may know the reason for that. Since the program rounds score values to 4 digits, A = 0.01 may not be high enough to get a score value below 0.99995, which would get rounded up to 1. I personally haven't experimented with setting score weights at the values we're setting for this exercise. If I'm correct, then maybe the solution is to take the floor (0.99995 -> 0.9999) as opposed to simple rounding. Let me know if you consider that a useful change.

"So, maybe it would be more accurate to say that you intend that all users use calcScores() with some combination of A, B, and C, that will result in variation in metabCombiner_object@combinedTable$score? is that closer to the mark?"

Yes, that would be correct. As it currently stands, the only "stopping" condition is that all score values are equal to 1, though I can change that to whenever the variance of the score values is equal to 0 (i.e. the same score for the entire column).

Phylloxera commented 2 years ago

I'm not really sure about the first question. since I'm not certain exactly what we're solving for. But I know how to reproduce my results with the new version (by tricking the package). And I know that you expect me to generate variable metabCombiner_object@combinedTable$score values whether-or-not I am using method = "scores", also, I know that this was always the expectation, just one that I was able to bypass in the earlier version since I didn't really understand the series of calculations (and probably still really don't, but that's ok)... so my best 'next step' as I see it, is to recode to meet this expectation.

I guess that's all I really need to know to move forward!

Again, thanks for humoring me!