Multi-samples data - Githubissues

Irrationone commented 7 years ago

Hi,

I'm interested in looking for partitions that are shared between multiple samples (from different anatomic locations) from the same patient. What's the best way to do this (that's still feasible in computational time)? Each sample has upwards of 300k sequences.

At the moment I'm juggling two options. The first would be to merge the FASTQ files of all samples from the same patient together before partitioning (using vsearch followed by seed-unique-id to get cluster annotations), though I'm not certain if merging is a robust approach. The other option I can think of would be to partition and run-viterbi each sample separately, then merge the viterbi outputs and re-cluster to get a centroid sequence to use as a seed-seq for each sample.

Thanks!

psathyrella commented 7 years ago

The answer depends somewhat on whether you're mostly interested in the largest clusters or not, but assuming you are, I would probably:

first get a rough idea of the clusters in each sample, in either of two ways:
- run the full partitioning on a random subset of each sample, setting --print-cluster-annotations and --outfname so you get the annotation for each final cluster. How large you make the random subset just depends on your computational resources -- I'd use as many processors as you have available, and gradually increase the sample size until time becomes prohibitive. You ought to be able to get up to 10k-100k, which should sample from any cluster of reasonable size (note that full clustering is an order of magnitude or two faster than it was several months ago, i.e. than it was in the time required plot in the paper). If you want to be careful, you would run this step on all subsets of that size (e.g. if each one is a tenth of the sample, you run on all tenths of the sample -- because clustering roughly scales as the square of the number of sequences, this is vastly faster than running on the whole sample).
- run vsearch partitioning on each of the full samples, again with --print-cluster-annotations and --outfname set
in both cases as part of the annotations for each final cluster (in the cluster annotation file) you'll get the naive sequence for each cluster. I would use each of these naive sequences as seeds (with --seed-unique-id or --seed-seq) to run full partitioning on the other samples. Presumably you'd start with the naive sequences from the largest clusters and work you're way down from there.

If you're not mostly interested in the larger clusters, then the problem is more computationally difficult. Probably in that case, yeah, I would merge the samples together. So backing up a little, the absolutely simplest most proper way to find all the clonal groups among all the samples would just be to merge all the samples together and partition the resulting super sample. Even if you don't do that in its entirety, I would merge random subsets of each sample together and cluster on that "sub-super-sample" just as a validation on the above seeded methods.

Irrationone commented 7 years ago

Thanks! I'll probably go with taking a sub-super-sample -- I am only interested in large clusters, but primarily those that have representation from multiple samples (which doesn't seem to be that common).

Does partis remove duplicates now? I got this message in the sw execution:

removed 193311 / 260790 = 0.74 duplicate sequences after trimming framework insertions (leaving 67479)

is there any way to keep track of those that were flagged as duplicates (so I can get readcounts at the end)?

psathyrella commented 7 years ago

ok, sounds good. that's definitely the simplest approach.

Yeah, it does. It was initially for somewhat more specialized reasons, but clustering time is so dependent on sample size that I think it makes sense to kill all duplicates in general. I might be able to add some read count information, but there isn't any at the moment. I was initially imagining that samples with significant duplicates would be rare, but maybe that's not the case and I should handle the information.

Irrationone commented 7 years ago

Yeah, I'm getting ~70% duplicate sequences removed on average over 10+ samples so far.

Irrationone commented 7 years ago

Could the centroids be printed out for --naive-vsearch? Perhaps as the first entry in each partition if that's easiest. At the moment I'm testing a few random entries in each large partition to seed; I imagine some of the random seeds I select are at the edge of a large vsearch cluster and would be less likely to belong to the true partition.

psathyrella commented 7 years ago

huh, replying to the email doesn't post here. weird. for posterity:

Well I still recommend using the naive sequence... but, oh, right, maybe --print-cluster-annotatikns doesn't work for vsearch? I'll test and fix if not. The naive seas should be the best estimate of the center of the cluster I think

Irrationone commented 7 years ago

Yeah, --print-cluster-annotations doesn't do anything for the --naive-vsearch.

psathyrella commented 7 years ago

Yeah I never use the vsearch option, so I hadn't noticed. I'm fixing, and turning on cluster annotation printing by default now.

psathyrella commented 7 years ago

ok, the version I just pushed writes cluster annotations by default, including with vsearch partitioning.

jotwin commented 7 years ago

Ah thanks, this advice is relevant to me. I ran the latest partis from docker, but I'm getting this error when I try vsearch partitioning.

/partis/bin/partis partition --infname patient1_fewer_singletons_200k_subsample.fasta --outfname patient1_fewer_singletons_200k_subsample_vsearch.csv --parameter-dir patient1_fewer_singletons_params_full --n-partitions-to-write 1 --naive-vsearch --n-procs 12 partitioning smith-waterman processed remaining new-indels rerun: unproductive no-match weird-annot. nonsense-bounds invalid-codon indel-fails super-high-mutation 209182 25073 0 11984 266 116 5 113 12584 5 increasing mismatch score (1 --> 2) and rerunning them 25073 12558 12218 0 118 0 31 189 0 2 rerunning for indels 12558 543 0 0 118 0 31 300 87 7 increasing mismatch score (2 --> 3) and rerunning them 543 504 23 0 39 0 54 324 63 1 rerunning for indels info for 208678 / 209182 = 0.998 kept 14987 (0.072) unproductive removed 1313 / 208678 = 0.01 duplicate sequences after trimming framework insertions (leaving 207365) water time: 1438.3 hmm --> caching all 207365 naive sequences running 24 procs no/empty cache file calcd: vtb 207365 fwd 0 merged: hfrac 0 lratio 0 min-max time: 662.7 - 675.5 sec hmm step time: 739.5 naive hfrac bounds: 0.026 0.026 (0.082 mean mutation in parameter dir patient1_fewer_singletons_params_full/hmm) using hfrac bound for vsearch 0.026 Traceback (most recent call last): File "/partis/bin/partis", line 496, in args.func(args) File "/partis/bin/partis", line 222, in run_partitiondriver parter.partition() File "/partis/python/partitiondriver.py", line 314, in partition cpath = self.cluster_with_naive_vsearch_or_swarm(self.sub_param_dir) File "/partis/python/partitiondriver.py", line 506, in cluster_with_naive_vsearch_or_swarm proc = Popen(cmd.split(), stdout=PIPE, stderr=PIPE) File "/usr/local/lib/python2.7/subprocess.py", line 710, in init errread, errwrite) File "/usr/local/lib/python2.7/subprocess.py", line 1335, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory

Irrationone commented 7 years ago

I don't know for sure because I didn't use the Docker install, but that error's probably popping up because of the directory you call partis from. The path to vsearch is hardcoded in partitiondriver.py, so it doesn't find it unless you run from ./bin/partis.

psathyrella commented 7 years ago

arg, that's a bug. Thanks for pointing it out, I just fixed in the dev version (it'll pop up here when it's merged).

psathyrella commented 7 years ago

if you need it in the meantime the diff for that commit is very small

psathyrella / partis

Multi-samples data #219