Closed Irrationone closed 6 years ago
The answer depends somewhat on whether you're mostly interested in the largest clusters or not, but assuming you are, I would probably:
If you're not mostly interested in the larger clusters, then the problem is more computationally difficult. Probably in that case, yeah, I would merge the samples together. So backing up a little, the absolutely simplest most proper way to find all the clonal groups among all the samples would just be to merge all the samples together and partition the resulting super sample. Even if you don't do that in its entirety, I would merge random subsets of each sample together and cluster on that "sub-super-sample" just as a validation on the above seeded methods.
Thanks! I'll probably go with taking a sub-super-sample -- I am only interested in large clusters, but primarily those that have representation from multiple samples (which doesn't seem to be that common).
Does partis
remove duplicates now? I got this message in the sw execution:
removed 193311 / 260790 = 0.74 duplicate sequences after trimming framework insertions (leaving 67479)
is there any way to keep track of those that were flagged as duplicates (so I can get readcounts at the end)?
ok, sounds good. that's definitely the simplest approach.
Yeah, it does. It was initially for somewhat more specialized reasons, but clustering time is so dependent on sample size that I think it makes sense to kill all duplicates in general. I might be able to add some read count information, but there isn't any at the moment. I was initially imagining that samples with significant duplicates would be rare, but maybe that's not the case and I should handle the information.
Yeah, I'm getting ~70% duplicate sequences removed on average over 10+ samples so far.
Could the centroids be printed out for --naive-vsearch? Perhaps as the first entry in each partition if that's easiest. At the moment I'm testing a few random entries in each large partition to seed; I imagine some of the random seeds I select are at the edge of a large vsearch cluster and would be less likely to belong to the true partition.
huh, replying to the email doesn't post here. weird. for posterity:
Well I still recommend using the naive sequence... but, oh, right, maybe --print-cluster-annotatikns doesn't work for vsearch? I'll test and fix if not. The naive seas should be the best estimate of the center of the cluster I think
Yeah, --print-cluster-annotations
doesn't do anything for the --naive-vsearch
.
Yeah I never use the vsearch option, so I hadn't noticed. I'm fixing, and turning on cluster annotation printing by default now.
ok, the version I just pushed writes cluster annotations by default, including with vsearch partitioning.
Ah thanks, this advice is relevant to me. I ran the latest partis from docker, but I'm getting this error when I try vsearch partitioning.
/partis/bin/partis partition --infname patient1_fewer_singletons_200k_subsample.fasta --outfname patient1_fewer_singletons_200k_subsample_vsearch.csv --parameter-dir patient1_fewer_singletons_params_full --n-partitions-to-write 1 --naive-vsearch --n-procs 12 partitioning smith-waterman processed remaining new-indels rerun: unproductive no-match weird-annot. nonsense-bounds invalid-codon indel-fails super-high-mutation 209182 25073 0 11984 266 116 5 113 12584 5 increasing mismatch score (1 --> 2) and rerunning them 25073 12558 12218 0 118 0 31 189 0 2 rerunning for indels 12558 543 0 0 118 0 31 300 87 7 increasing mismatch score (2 --> 3) and rerunning them 543 504 23 0 39 0 54 324 63 1 rerunning for indels info for 208678 / 209182 = 0.998 kept 14987 (0.072) unproductive removed 1313 / 208678 = 0.01 duplicate sequences after trimming framework insertions (leaving 207365) water time: 1438.3 hmm --> caching all 207365 naive sequences running 24 procs no/empty cache file calcd: vtb 207365 fwd 0 merged: hfrac 0 lratio 0 min-max time: 662.7 - 675.5 sec hmm step time: 739.5 naive hfrac bounds: 0.026 0.026 (0.082 mean mutation in parameter dir patient1_fewer_singletons_params_full/hmm) using hfrac bound for vsearch 0.026 Traceback (most recent call last): File "/partis/bin/partis", line 496, in
args.func(args) File "/partis/bin/partis", line 222, in run_partitiondriver parter.partition() File "/partis/python/partitiondriver.py", line 314, in partition cpath = self.cluster_with_naive_vsearch_or_swarm(self.sub_param_dir) File "/partis/python/partitiondriver.py", line 506, in cluster_with_naive_vsearch_or_swarm proc = Popen(cmd.split(), stdout=PIPE, stderr=PIPE) File "/usr/local/lib/python2.7/subprocess.py", line 710, in init errread, errwrite) File "/usr/local/lib/python2.7/subprocess.py", line 1335, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory
I don't know for sure because I didn't use the Docker install, but that error's probably popping up because of the directory you call partis
from. The path to vsearch
is hardcoded in partitiondriver.py
, so it doesn't find it unless you run from ./bin/partis
.
arg, that's a bug. Thanks for pointing it out, I just fixed in the dev version (it'll pop up here when it's merged).
if you need it in the meantime the diff for that commit is very small
Hi,
I'm interested in looking for partitions that are shared between multiple samples (from different anatomic locations) from the same patient. What's the best way to do this (that's still feasible in computational time)? Each sample has upwards of 300k sequences.
At the moment I'm juggling two options. The first would be to merge the FASTQ files of all samples from the same patient together before partitioning (using vsearch followed by seed-unique-id to get cluster annotations), though I'm not certain if merging is a robust approach. The other option I can think of would be to partition and run-viterbi each sample separately, then merge the viterbi outputs and re-cluster to get a centroid sequence to use as a seed-seq for each sample.
Thanks!