mani2012 / PathoStat

The purpose of this package is to perform Statistical Analysis on the PathoScope generated reports files.
8 stars 9 forks source link

Enabled caching of taxonomy data #3

Closed mlbendall closed 8 years ago

mlbendall commented 8 years ago

Enabled caching of taxonomy data to R file. Most of the time spent running runPathoStat is spent querying NCBI for taxonomy data. This allows you to save the downloaded taxonomy to an Rdata file and reload it on subsequent runs. (Also, should make the development process faster.)

Enable this by providing a file path to the tax_cache argument of runPathoStat. If the file does not exist, findTaxonomy is called as usual, then the results are saved to the file path. If the file does exist, the taxonomy is loaded from the file.

Example 1 with caching (below) takes 30-40 seconds on the first run, but launches in 1-2 seconds on subsequent runs.

example_data_dir <- system.file("example/data", package = "PathoStat")
runPathoStat(input_dir=example_data_dir, batch=batch, condition=condition, 
             report_file="pathostat_report.html", report_dir=".", report_option_binary=
               "111111111", view_report=FALSE, interactive=TRUE, tax_cache='./taxonomy.Rdata')

Reset the cache by deleting the Rdata file.

mani2012 commented 8 years ago

I was in the process of having my own PathoStat object which is an extension of the phyloseq object, as input to runPathoStat() function. This object being an extension of phyloseq, automatically has the taxonomy information. I was just planning to write a function that will take the pathoscope report files directory as input and generate this PathoStat object which will then be used for passing to runPathoStat. This way, we do not need worry about caching as the taxonomy will be generated once and stored in PathoStat object for doing the analysis any number of times later. And also the PathoStat object stored as a rdata file can be given as input by the user by just browsing the file.

mlbendall commented 8 years ago

Sure, I agree with that goal. Should we merge this in until the new function is up and running? Or just close this request?

mani2012 commented 8 years ago

Ok, let's merge this for now.

ecastron commented 8 years ago

Mani,

Perhaps you could recycle the piece of code we wrote on getting taxonomy lineages using taxonomy IDs. It’ll require another dependency (taxize) but it’s fairly fast. I’m attaching the function if you want to give it a second look.

It’ll take a directory with tsv files and produce two files: a otu table and a taxonomy matrix.

Take care,

Eduardo

On May 6, 2016, at 4:13 PM, mlbendall notifications@github.com wrote:

Sure, I agree with that goal. Should we merge this in until the new function is up and running? Or just close this request?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/mani2012/PathoStat/pull/3#issuecomment-217533424

mani2012 commented 8 years ago

Yes, I noticed that you had used the library 'taxize'. Since 'taxize' also makes web request, I was not sure whether it is any different in performance than 'rentrez' or even whether that also uses 'rentrez' under the hood. We can certainly try with 'taxize' based on your function, as I had noticed 'rentrez' failing sometime when the server is busy.

ecastron commented 8 years ago

I can only add that taxize is faster, however sometimes it gets the same errors as rentrez when the server is busy. We worked our way around by retrying until the servers responds

Cheers,

Eduardo

On May 6, 2016, at 4:54 PM, Solaiappan Manimaran notifications@github.com wrote:

Yes, I noticed that you had used the library 'taxize'. Since 'taxize' also makes web request, I was not sure whether it is any different in performance than 'rentrez' or even whether that also uses 'rentrez' under the hood. We can certainly try with 'taxize' based on your function, as I had noticed 'rentrez' failing sometime when the server is busy.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/mani2012/PathoStat/pull/3#issuecomment-217542750