statdivlab / enviromtx

A package to analyze environmental metatranscriptomics data
Other
1 stars 1 forks source link

code hangs? #8

Closed adw96 closed 4 months ago

adw96 commented 5 months ago

Hi @svteichman ! Would you please be able to help @zbartolek with the following issue:

This code runs, but never terminates -- not sure why

load_all()
library(tidyverse)
df <- read_csv("/Users/adwillis/Downloads/example5.csv")
fit_mgx_model(df$counts_sum, df$xstar, df$xx,
              formula = ~ temp,
              enviro_df = df, 
              replicates=df$sample_name)

example5.csv

Thanks so much for any insights you have!

svteichman commented 5 months ago

I've looked into this, and I've traced the error into the geepack package, from line 104 of the geese.fit() function. This line is .Call("gee_rap", y, x, offset, soffset, weights, linkwaves, zsca, zcor, corp, clusz, geestr, corr, param, control, PACKAGE = "geepack"), which is passing R code into C to fit the gee, and this is unfortunately the limit of my debugging ability. What I believe is happening is that the optimization is being done in C (after all the set-up is done in the R function), and my best guess is that for this example convergence is never hit and for some reason there is no error result thrown. It seems similar to what is described in this example here.

To move forward, one option is to open an issue in geepack, but I checked their github and there are several open issues from the last year, so I'm not sure whether it would be resolved. @adw96 do you have any ideas about how to resolve this for the specific example that @zbartolek is trying to run? Is there a different function that we could replace the call to geeglm with that would do something equivalent? I could try using the gee package instead of geeglm to run the gee optimization for this data analysis. Although gee doesn't seem to give us an option to use jackknife se's.

adw96 commented 5 months ago

Thanks so much, @svteichman ! Great work tracing this back to its core.

@zbartolek -- this is the only example of this phenomenon that you've seen so far, correct? How many examples have you looked at? If it looks to be very common than we might have to take it further. I really don't want to have to (eg) write our own backup in optim.

@svteichman would you mind opening an issue in geepack? Alternatively, feel free to draft it for me to open. It's very important for it to be fully and easily reproducible (eg by using lots of dput calls), since these maintainers probably got a lot of random questions amongst their "issues", and they will prioritize the ones that are easy to address.

zbartolek commented 5 months ago

Thanks so much for looking into this @svteichman and @adw96 ! So far this is the only example I saw of this, I ran ~2000 tests for now. I will try to run on a larger dataset tomorrow to see if this is prevalent. For now, do you have a suggestion on how to terminate the run or skip if a hang like this happens again? Right now I'm running this code in a loop over many examples, and it would be nice if I could continue to the next example automatically if there is a hang. Thanks so much!!

svteichman commented 5 months ago

@zbartolek That's a great question. I tried to use the function withTimeout() with hopes of using this to terminate the example above that hangs but it didn't work. If that isn't able to terminate it, I'm not sure what would be able to, especially because I think that the issue is something that is getting stuck in the C code called by R. It's possible that there is a better way, but not one that I can come up with.

The best that I can think of is running the loop that you want, writing the results to a file frequently (every iteration, every hundred iterations, whatever makes sense to you), and then if/when it gets stuck, removing that data example from the set that you're running and re-starting where it is since you've last saved.

zbartolek commented 5 months ago

@svteichman Yes, I also tried the withTimeout function without success. Sounds good, Il will try to organize the code for now so that it saves frequently... will let you know if I run into any issues. Thanks a lot for your help with this!

zbartolek commented 5 months ago

I've been trying to figure out why this particular example failed to converge, and I think it could be because of perfect separation or quasi-perfect separation between the count variable and the sample name variable, since now we group samples to account for replicates. As far as I understand, this problem is not uncommon if you have a categorical variable (in our case sample name) and a small sample size (also true for our case).

@svteichman and @adw96 do you think this reasoning is correct? To try to solve this for now without having to re-start a bunch of failed R runs, I thought it might be possible to test for perfect or quasi perfect separation prior to running the model, and for now remove those examples. Do you have any suggestions on how to test for this?

I was also reading about ways to deal with this more systematically, like using a penalized likelihood estimation. I don't have a good idea of how easy it would be to implement something like that or if it is even appropriate for our case.

Thanks a lot!

svteichman commented 5 months ago

@zbartolek I think you're probably right, separation could cause infinite parameter estimates and therefore code that hangs forever. I'll check on this on your dataset when I get a chance later today, but there is an R package detectseparation that you can use to check for this. I think this is a great idea to test for separation in raoBust (this is the statdivlab R package that is doing the testing, called from enviromtx), I'll add that to my list of things to do in the medium term.

The two common ways that I know of to deal with separation are (1) identify the separation, say that what you're testing is perfectly separated and therefore an interesting example that you cannot estimate parameters for, and move on, or (2, as you suggested!) use a penalty so that parameter estimates are always finite. Adding this kind of penalty would be a larger modification for this method, so I'll leave it up to @adw96 to decide if she thinks that is a good idea.

svteichman commented 5 months ago

The more that I think about this, the more I think it's slightly more complicated in the GEE case compared to the GLM case (and the R package that I sent is for categorical responses and binomial response GLMS). I'll spend some more time thinking this over and doing some experimenting and get back to you @zbartolek!

zbartolek commented 5 months ago

Thanks a lot @svteichman I'm having some issues installing detectseparation package, I think there is a dependency for lpSolveAPI that is causing issues so I wasn't able to try it out yet. But yes, I did notice it was made for binomial... thanks a lot for all your work, let me know if I can help in any way!

svteichman commented 5 months ago

@zbartolek do you know if there are any other examples in your dataset in which all of the response values for a given cluster are 0 (as we see in this example for several of the clusters)?

svteichman commented 5 months ago

I suspect that it is in fact an issue with separation by the clusters, because I was able to run a verbose version of the code, and the estimated $\beta$ values from geepack::geeglm (which is called by fit_mgx_model) get very large just before the infinite loop starts, and estimated $\beta$ values that diverge to infinity is a sign of separation. I'll work on coming up with a way to check for this type of cluster separation in data (the same thing with the code hanging happens when I replace 0's in the response with small numbers, so it isn't just perfect separation but also quasi-separation that causes problems) and then implementing this as a check in enviromtx.

zbartolek commented 5 months ago

@svteichman Here are a few other examples:

These examples work with the model and converge (examples1-4): example1.csv example2.csv example3.csv example4.csv

These examples are all the ones I have been able to find so far that don't converge (examples5-7): example5.csv example6.csv example7.csv example7 does not converge, but also doesn't result in a code hang.

I played around with the detectseparation package with their detectSeparation() function, but I think because it's made for binomial glm it didn't work well, even for example5, it predicted no separation.

Thanks!!

svteichman commented 5 months ago

This is really helpful! Interestingly, all of these examples have some cases for which there are whole clusters with response values of 0 for each observation, so that on it's own isn't what is causing the problem, since examples 1-4 run fine. Nothing is really jumping out to me about how the data in examples 1-4 are different than those in examples 5-7, but I'll do some more digging into it and see if I can find anything.

svteichman commented 5 months ago

Hi @zbartolek, I wanted to check back in with you. I haven't been able to figure out any characteristics of these data examples that separate examples 1-4 from 5-7, so I'm not sure what is causing this issue. I'm drafting an issue to open in geepack about it, but was wondering if there's anything else that I could do to help you in the short term while this issue is still happening. Were you able to run the rest of your analyses?

zbartolek commented 5 months ago

Hi @svteichman, thanks for checking in! I'm currently trying to figure out how prevalent the problem is, so running the dataset kind of like you suggested, separating it out and then removing problematic cases. Il let you know once I figure out how prevalent the problem is (we have a huge dataset so it's taking me some time). Thanks for opening the issue on geepack, I'm also not sure what more to do about fixing it right now. Thanks!

svteichman commented 5 months ago

@zbartolek I was able to add an error in to enviromtx so that the code stops before running into an infinite loop. This hasn't been added to the package yet because I want to chat with Amy about it, but if you want a version that will have an error instead of the code hanging you can install enviromtx and raoBust from the forks from my account.

remotes::install_github("svteichman/raoBust')
remotes::install_github("svteichman/enviromtx") 

Hopefully this will make your workflow a little bit easier! You can wrap each call to fit_mgx_model in try({ mod <- fit_mgx_model(...) }) and if it errors then it won't stop whatever loop or apply statement that you might be running over your set of examples.

We'll update you once we have a more permanent solution (or figure out why these specific examples are having this problem).

zbartolek commented 5 months ago

Hi @svteichman Thanks a lot! I will pull your version, an error message is much easier to deal with than a hang :)

zbartolek commented 5 months ago

Hi @svteichman , I'm trying to run your version of the code, and it's working a lot better, I'm seeing a lot of examples that otherwise would have resulted in a code hang that are now getting skipped. I did still run into a code hang with this example (example8.csv), do you know why that might be happening? Thanks a lot!

svteichman commented 5 months ago

I imagine it's because what I'm doing is running another similar package, seeing if it fails with an error, and if so stopping fit_mgx_model before it tries to use the package that has code that hangs. However, I haven't looked too closely under the hood and they are probably fitting the models differently. My guess is that a lot of these examples that would fail with the code-hanging package will also fail with the package that provides an error, so it'll cut down on the examples that hang, but that because they are fitting the models differently there are some examples that won't error in the first package but will have an issue that causes code to hang in the second.

zbartolek commented 5 months ago

Ah, yes, that makes sense. So far on the data that I have tried to run, I find the no convergence issue about 1 in every 100 fits, which is quite a lot I think. Your workaround solved it for most cases, but not all, so Il try to do a more manual selection of those now. Thanks!

svteichman commented 5 months ago

This is good to know. Thanks!

On Fri, Mar 29, 2024 at 3:09 PM zbartolek @.***> wrote:

Ah, yes, that makes sense. So far on the data that I have tried to run, I find the no convergence issue about 1 in every 100 fits, which is quite a lot I think. Your workaround solved it for most cases, but not all, so Il try to do a more manual selection of those now. Thanks!

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/statdivlab/enviromtx/issues/8*issuecomment-2027781962__;Iw!!K-Hz7m0Vt54!iuex6XXbrKy0KEeHztEwKMk__WbgrMCg6nPEp5Tfh2tGXyzPEeh3NYhFVhEqFdEuIdUzLN4ypfcaTDPAlRpjgA$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEDVQ7K4JZNVNHHLDLPKUOTY2XRCPAVCNFSM6AAAAABEY3IE5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXG44DCOJWGI__;!!K-Hz7m0Vt54!iuex6XXbrKy0KEeHztEwKMk__WbgrMCg6nPEp5Tfh2tGXyzPEeh3NYhFVhEqFdEuIdUzLN4ypfcaTDOs6Lm-ig$ . You are receiving this because you were mentioned.Message ID: @.***>

svteichman commented 4 months ago

@zbartolek, we've updated raoBust to now only use geeasy and never use geepack, so we're hoping that the code should stop hanging (what we lose are the standard errors that we prefer, but we're going to work on coming up with a solution to this). If you re-install enviromtx with remotes::install_github("statdivlab/enviromtx"), we are hoping that the code will only error now where it would've hung before. If this isn't the case, please let us know! I'm going to close this issue (and open a new one in which we will investigate why the errors are caused), but if it hangs again feel free to reopen.

adw96 commented 4 months ago

Huge thanks, @svteichman !!!!! I really appreciate this.

@zbartolek Let us know if the code still hangs for you for any combinations, and please share those cases with us so we can work on them for you!

zbartolek commented 4 months ago

Wow, that's quite amazing @svteichman and @adw96, thanks a lot for your work on this! I will test this out tonight and let you know tomorrow if I still see any hangs!

zbartolek commented 4 months ago

Seems to be working well for now, thanks!! I get an error message for those that can't converge! Started running on a larger portion of the data, thanks so much @svteichman for your help with this!