progress bar for get_hmm and get_big_pi

TS404 commented 6 years ago

get_hmm and get_big_pi can take a while. When submitting >1000 sequences, I've tended to do them individually in order to salvage the results if it becomes unresponsive after an hour. I don't know if tghe server prefers batch queries rather than repeated individual queries, but even so, sequences could be submitted in batches of 10 or 50 to give an idea of estimated time.

annotx <- NULL
pbt    <- txtProgressBar(min = 0, max = length(sequences), style = 3)
pbw    <- winProgressBar(min = 0, max = length(sequences), title = "HMM progress")

for(i in 1:length(sequences)){
    seqsubset <- sequences[i]
    annotx <- rbind(annotx, ragp::get_hmm(sequence = seqsubset,
                                          id = names(seqsubset),
                                          verbose = FALSE,
                                          sleep = 0))
    setTxtProgressBar(pbt, i)
    setWinProgressBar(pbw, i, title= paste("HMM progress:",
                                           round(i/length(sequences)*100, 0),
                                           "%      (",
                                           names(sequences[i]),
                                           ")"
  ))
}
close(pbw)
annotx

missuse commented 6 years ago

I have had similar experience with get_hmm. Some days the hmmscan server is very unresponsive. Batch upload to hmmscan is even worse. Uploads are put in a queue and sometimes hours pass just to start.

Currently I would like to change get_hmm to resubmit the sequence after some time if the result is not provided. If the 2nd submit hangs, the functions ends returning results for the sequences up to there and an Error message. I trust this is the best solution in this case.

For 10k+ sequences I recommend using hmmer. Perhaps a function that will take output from hmmer and import it in the same format as output of hmmscan function?

I haven't had this problem with get_big_pi, usually only sequences containing an N-sp are sent to big pi and in my experience it works solid for up to 5k sequences. Batch queries would be a good addition (perhaps faster), this requires a complete rewrite of the function. I will do some testing and then decide if we shall go in this direction.

Progress bar would be a nice addition, perhaps when verbose = FALSE a progress bar is displayed. for both functions and even for the *_file functions.

TS404 commented 6 years ago

Good idea for get_hmm! I think that's a very sensible way to do it.

I've not managed to recreate the get_big_pi issue so perhaps there was something odd about the time it happened to me.

missuse commented 6 years ago

I have managed to speed up get_big_pi significantly, and I have implemented the progress bar as per the suggestion. I just need to perform some checking before deployment. I trust it will be available during the weekend. I think I will update all the scraping functions with progress bars.

missuse commented 6 years ago

get_big_pi has been updated. The update should provide a significant speed up, and it should be now more in line with the speed of get_phobius.

For instance:

system.time(
  test_big_pi <- ragp::get_big_pi(at_nsp[1:1000,],
                                  sequence,
                                  Transcript.id,
                                  simplify = FALSE)

)
  #output
   user  system elapsed 
   3.53    0.13   57.88

Bugs are possible, if you stumble upon any please report them.

missuse commented 6 years ago

get_hmm has been updated.

New arguments are:

timeout - time in seconds to wait for the server response (default = 10s).
attempts - number of attempts if server unresponsive (default = 2 times).

If number of attempts is exhausted the function will issue a warning and return the queries finished so far.

Additionally a progress bar has been added.

Bugs are possible, if you stumble upon any please report them.

Thank you for suggestions.

Closing this issue.

missuse / ragp

progress bar for get_hmm and get_big_pi #3