matsengrp / sumrep

Summary statistics for repertoires
16 stars 6 forks source link

Nearest_Neighbour error #21

Closed Pezhvuk closed 5 years ago

Pezhvuk commented 5 years ago

Hi Branden,

Here is the report for an error to do with getNearestNeighbourDistribution function:

Error in sample.int(length(x), size, replace, prob): invalid first argument
Traceback:

1. getNearestNeighborDistribution(igblast_output$annotations, k = 1)
2. getApproximateNearestNeighborDistribution(dat = dat, column = column, 
 .     ...)   # at line 283-287 of file /d/as2/u/mp002/sumrep-new/R/SummaryStats.R
3. dat %>% doNNSubsamplingBatchStep(column = column)   # at line 104-105 of file /d/as2/u/mp002/sumrep-new/R/Approximation.R
4. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
5. eval(quote(`_fseq`(`_lhs`)), env, env)
6. eval(quote(`_fseq`(`_lhs`)), env, env)
7. `_fseq`(`_lhs`)
8. freduce(value, `_function_list`)
9. withVisible(function_list[[k]](value))
10. function_list[[k]](value)
11. doNNSubsamplingBatchStep(., column = column)
12. replicate(batch_size, {
  .     unique_id <- sample(dat$unique_id, 1)
  .     seq_dat <- dat[dat$unique_id == unique_id, ]
  .     nn_dist <- stringdist::stringdist(dat[dat$unique_id != unique_id, 
  .         ][[column]], seq_dat[[column]], method = "lv") %>% min
  . })   # at line 129-141 of file /d/as2/u/mp002/sumrep-new/R/Approximation.R
13. sapply(integer(n), eval.parent(substitute(function(...) expr)), 
  .     simplify = simplify)
14. lapply(X = X, FUN = FUN, ...)
15. FUN(X[[i]], ...)
16. sample(dat$unique_id, 1)   # at line 131-133 of file /d/as2/u/mp002/sumrep-new/R/Approximation.R
17. sample.int(length(x), size, replace, prob)

I have also attached couple of the files, which, I found to return this error. They seem to behave well with all metrics, save for the getNearestNeighbourDistribution.

I have searched the whole repository, but I cannot find any reference to the sample variable (assuming it's a variable), nor have I found any reference to the specific cause of the issue, i.e. sample.int(length(x), size, replace, prob). The only reference I found to sample.int is in the DivergenceFunctions.R.

Any idea what's going wrong? I suspect it has something to do with the length(x), but a more specific error report is also needed. Perhaps, a try and except block for exception handling is needed for the metrics.

S-FV_-1h.txt S-FV_+7d.txt

BrandenOlson commented 5 years ago

Hey Peji,

It looks like you do not have the most recent version of sumrep. I believe this is a bug I fixed a few months ago -- it stems from the line unique_id <- sample(dat$unique_id, 1) (also hinted at on line 16 above), which erroneously assumes your dataset has a unique_id column.

Please pull the most recent version and let me know if you are still running into an issue with this function.

Thanks, Branden

Pezhvuk commented 5 years ago

Hey Branden,

Thank you for the prompt help! Silly mistake on my part indeed. Copy-pasting bits of my scripts last night must have replaced the line pointing to the most recent version (5 days old), with the archived one I kept.

I also have another question regarding threading. Do the distance-based metrics, e.g. NearestNeighbour use up as many threads as possible by default?

Cheers, Peji.

Pezhvuk commented 5 years ago

In the interest of this topic arising in the future; I got the answer to my question above, about threading. It seemed that R runs on one core for a while, right after igblast annotation is finished, which, made me question whether something was not right. But, then defineClones.py started running on multiple cores, followed by R running on the specified number of cores (for getNearestNeighbourDistribution).

It seems that the single-thread period between igblast and defineClones.py is to do with Shazam and nothing to do with Sumrep, as in the following block:

cluster_threshold <- annotations %>%
                    shazam::distToNearest(sequenceColumn="junction",
                                          vCallColumn="v_call",
                                          jCallColumn="j_call"
                                         ) %$%
                    DIST_NEAREST %>%
                    shazam::findThreshold() %>%
                    slot(name="threshold")

Thanks, Peji.

javh commented 5 years ago

You should be able to pass an nproc argument to shazam::distToNearest. It just defaults to nproc=1.

BrandenOlson commented 5 years ago

Thanks, @javh! It seems like a straightforward and sensible idea to allow nproc to be passed to shazam::distToNearest for any sumrep functions which use it, defaulting to nproc=1. How does that sound, @Pezhvuk?

Pezhvuk commented 5 years ago

Sounds great @BrandenOlson ! I think it might make a big difference on large repertoires.

javh commented 5 years ago

Setting nproc=alakazam::cpuCount() should work for auto-detection on windows/mac/linux, if you want to go that route.

I personally prefer requiring the user to explicitly specify the process count though, because that gets weird on clusters sometimes (cores allocated for the job are not necessarily equal to cores on the node). That's just personal taste though.

BrandenOlson commented 5 years ago

I also would prefer requiring explicit user specification of nproc. Perhaps I'll mention the nproc=alakazam::cpuCount() option in the documentation for those users who wish for auto-detection, with a 'use at your own risk' warning.

BrandenOlson commented 5 years ago

Closed in c2b7e36bb9e4cdde63bb9951995944a2861b9723.

BrandenOlson commented 5 years ago

@Pezhvuk -- the nproc functionality should be ready to use. I also allowed for environment variables for the igblast_dir and changeo_dir arguments, which might make your life easier.

Pezhvuk commented 5 years ago

@BrandenOlson, fantastic!