Closed Pezhvuk closed 5 years ago
Hey Peji,
It looks like you do not have the most recent version of sumrep. I believe this is a bug I fixed a few months ago -- it stems from the line unique_id <- sample(dat$unique_id, 1)
(also hinted at on line 16 above), which erroneously assumes your dataset has a unique_id
column.
Please pull the most recent version and let me know if you are still running into an issue with this function.
Thanks, Branden
Hey Branden,
Thank you for the prompt help! Silly mistake on my part indeed. Copy-pasting bits of my scripts last night must have replaced the line pointing to the most recent version (5 days old), with the archived one I kept.
I also have another question regarding threading. Do the distance-based metrics, e.g. NearestNeighbour use up as many threads as possible by default?
Cheers, Peji.
In the interest of this topic arising in the future; I got the answer to my question above, about threading. It seemed that R runs on one core for a while, right after igblast annotation is finished, which, made me question whether something was not right. But, then defineClones.py
started running on multiple cores, followed by R running on the specified number of cores (for getNearestNeighbourDistribution
).
It seems that the single-thread period between igblast
and defineClones.py
is to do with Shazam and nothing to do with Sumrep, as in the following block:
cluster_threshold <- annotations %>%
shazam::distToNearest(sequenceColumn="junction",
vCallColumn="v_call",
jCallColumn="j_call"
) %$%
DIST_NEAREST %>%
shazam::findThreshold() %>%
slot(name="threshold")
Thanks, Peji.
You should be able to pass an nproc
argument to shazam::distToNearest
. It just defaults to nproc=1
.
Thanks, @javh! It seems like a straightforward and sensible idea to allow nproc
to be passed to shazam::distToNearest
for any sumrep functions which use it, defaulting to nproc=1
. How does that sound, @Pezhvuk?
Sounds great @BrandenOlson ! I think it might make a big difference on large repertoires.
Setting nproc=alakazam::cpuCount()
should work for auto-detection on windows/mac/linux, if you want to go that route.
I personally prefer requiring the user to explicitly specify the process count though, because that gets weird on clusters sometimes (cores allocated for the job are not necessarily equal to cores on the node). That's just personal taste though.
I also would prefer requiring explicit user specification of nproc
. Perhaps I'll mention the nproc=alakazam::cpuCount()
option in the documentation for those users who wish for auto-detection, with a 'use at your own risk' warning.
Closed in c2b7e36bb9e4cdde63bb9951995944a2861b9723.
@Pezhvuk -- the nproc
functionality should be ready to use. I also allowed for environment variables for the igblast_dir
and changeo_dir
arguments, which might make your life easier.
@BrandenOlson, fantastic!
Hi Branden,
Here is the report for an error to do with getNearestNeighbourDistribution function:
I have also attached couple of the files, which, I found to return this error. They seem to behave well with all metrics, save for the
getNearestNeighbourDistribution
.I have searched the whole repository, but I cannot find any reference to the
sample
variable (assuming it's a variable), nor have I found any reference to the specific cause of the issue, i.e.sample.int(length(x), size, replace, prob)
. The only reference I found tosample.int
is in the DivergenceFunctions.R.Any idea what's going wrong? I suspect it has something to do with the
length(x)
, but a more specific error report is also needed. Perhaps, a try and except block for exception handling is needed for the metrics.S-FV_-1h.txt S-FV_+7d.txt