ibiem-master / community

Discussion, Q&A, everything you want to say, formatted nicely
0 stars 0 forks source link

Problems Running DADA2 all the way through #15

Open granek opened 5 years ago

granek commented 5 years ago

Student: "Since Friday I haven't been able to make it through the [DADA2] pipeline without it freezing and having to restart the R session... and then having to re-run everything again because I wasn't able to save the environment.

I didn't have this issue in the past. Is it because the dataset is larger? A key hang-up point seems to be "learnErrors", which took over 12 hours to run and then caused Rstudio to crash when I tried to push to Github after this finally went through.

Is there anything I can do to make this work better? Can I periodically save the workspace environment so that I don't loose any important values?"

granek commented 5 years ago

The larger dataset will take longer to run, but I don't think it should cause it to hang or crash (e.g. because of memory limitations). 12 hours seems like a long time, but I don't have timing for the IBIEM environment, so maybe that is par for the course. I can time running the full lemur dataset through DADA2 if you would like, but, of course, that will slow things down for anyone else running at the same time. It is possible that you are hitting times when other people are keeping the server busy - when I just checked the server was busier than usual, but it was still well below 50% capacity, and now it is back to idling.

I am a little concerned that your environment was auto-saved during a crash and so a lot of baggage might be reloading into your environment every time you restart. A few things to try before you start a long running session:

  1. Clear the environment by going to the Environment tab in the top right pane of RStudio and clicking on the broom.
  2. Select "Restart R" under the "Session" menu
  3. Run htop in the Terminal pane to monitor activity. When you start learnErrors the top process should shoot up to near 100 (or possibly above) in the CPU% column and stay there for several hours while it is running.
  4. If you want to be more aggressive, you can try the suggestions in issue #9 restarting your container.

What are you pushing to Github after you ran learnErrors? It is a good idea to commit and push changes to your code before running something that is going to take a long time, just in case it crashes badly and takes all your code with it. If you are working on something that you aren't ready to share with your group you might want to try making a new branch and working there until you are ready to share.

Yes, you can save "checkpoints". The method I recommend is to use write_rds from the readr package (or saveRDS from base R) to save intermediate results that have taken a long time to compute. Another approach, which I do not recommend, is to use the save and load functions of base R to save the environment to an RData file and subsequently reload it. This is less work, but it can get you into a lot of trouble. I don't know of a general R method to save results in the middle of a computation (although some packages can save checkpoints specific to the package).

KDeaton commented 5 years ago

Thanks for the advice! To clarify, it was taking a LONG time to run learnErrors, I pushed to Githib when it was done, and then Rstudio crashed while trying to push. I re-ran it all up to learnErrors again yesterday and then learnErrors finished running overnight. It seemed to push to Github this morning just fine. When Rstudio crashed, it did not autosave the environment, so I've had to re-run everything to get the environment variables back a few times. I'll use write_rds as soon as the next chunk is finished.

KDeaton commented 5 years ago

I'm still having issues pushing to Github. When I try to stage for a commit, Rstudio freezes. I also can't run anything in just the terminal. Any advice?

granek commented 5 years ago

I’m wondering if you are having a git issue. What exactly are you committing to your git repo?

I recommend only committing .Rmd files, metadata, and other basic information that you need for your analysis. I think it is reasonable to include an RDS of the phyloseq object once it is done and you don’t expect to change it. Git can become slow if a repo has lots of files ("lots" probably means a minimum of several 100s) or large files (100s of MB), so you should not commit large data files (e.g. FASTQs), large intermediate files (e.g. filtered FASTQs), or RData or intermediate RDS files.

If you have been committing big files to your git repo, we can clean it out. If you haven't, let me know and I can try to figure out what else might be going on.