simulating paired chains

andreirajkovic commented 3 years ago

I've been attempting to simulate paired heavy and light chain data, however I've only been able to by modifying lines 695 and 696 in the recombinator.py script.

            # plines += ['input.tree1 = user(file=%s)' % treefname]
            # plines += ['input.tree1.format = Newick']  # this is the default, but adding this keeps it from printing a warning

to

            plines += ['input.tree.file = %s' % treefname]
            plines += ['input.tree.format = Newick']
            plines += ['output.sequence.file = %s' % leaf_seq_fname]

I haven't checked to see whether the simulated data I'm producing makes sense, but I thought I'd bring this issue to your attention.

psathyrella commented 3 years ago

hmmm thanks for posting. It's working for me, I wonder if you have a different version of bpp installed that's earlier in your path, or maybe our use of LD_LIBRARY_PATH isn't working?

partis has two versions of bpp included, which it chooses between based on if --no-per-base-mutation is set. Which is unfortunate, but as far as i could tell the 'newlik' branch, which is necessary for per-base mutation, had to be compiled locally so I couldn't just pull it from a package manager. It looks like atm it's appending its lib path at the end of LD_LIBRARY_PATH, which could be the issue.

psathyrella commented 3 years ago

unfortunately using the three lines ^ that are working for you (with input.tree.file rather than input.tree1) break the newlik branch/per-base-mutation (although they work with per-base mutation off).

If memory serves, their docs suggest they were switching to a new option parsing format, but it seems it's maybe either not completely implemented, or not backwards compatible or something.

andreirajkovic commented 3 years ago

I think you're right about the libraries being a problem echo $LD_LIBRARY_PATH. came up empty so I manually linked them with export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/partis/packages/bpp-newlik/_build/lib. But even then, bpp complains that the input.tree.file is missing.

As for the version, the command that is being passed contains bpp-newlik/_build so I think partis is correctly inferring the command.

To install bpp-newlik did you do anything beyond executing ./build.sh with-simulation?

Would it be possible to use the most up to date version of bppsuite?

Also I didn't know this, but cloning a repo does not actually clone the branch repos e.g. the eigen3 directory is empty and I have to manually clone each of those branched directories. Is that the intended behavior?

psathyrella commented 3 years ago

I'm not sure I'm understanding, but the LD_LIBRARY_PATH modification in recombinator.py I think only affects the env it'll use to run the subprocs, and not the env you're running partis from. If the shell you're running from has an unset ld path, that does suggest that it's also empty when recombinator starts adding to it, but something else (outside partis) could be modifying it beforehand as well.

No, the compilation in build.sh works for me.

I believe not, it looks to me like newlik is still many commits ahead of master, i don't know their plans for eventually merging, this thread is my only knowledge of that history.

Ah, this could be the problem. These are all git submodules (i.e. need a specific commit, for instance this), so if you cloned, you will have presumably pulled the wrong branch. I've only been testing the Docker install recently, which pulls submodules by default. What you want to do is this:

git submodule update --init --recursive

EDIT: to be clear, you'll presumably need to nuke what you have, then init/update the submodules.

I'll add it to the manual now.

psathyrella commented 3 years ago

ok added clone/submodule commands to docs. I'll need to finish a few things before I merge it to master tho.

andreirajkovic commented 3 years ago

The git hub submodule update fixed bpp_ newlik package! However I'm still having trouble simulating results that generate anything more than a singleton. I produce a simulated dataset with the following command:

./bin/partis simulate --paired-loci --simulate-from-scratch --paired-outdir /data/sim_scratch/ --n-sim-events 100 -n-trees 25

Next I run a partition on that data, but it only detects a single cluster. Am I missing some parameter here to force a more complex partition? ./bin/partis partition --infname /data/sim_scratch/all-seqs.fa --paired-loci --paired-outdir /data/param_output --n-procs 8 --count-parameters --parameter-out-dir /data/param_output

Simulating data from the partition confirms that only a single cluster exists for each pair. ./bin/partis simulate --paired-loci --parameter-dir /data/param_output/parameters --paired-outdir /data/simulated_output

Thank you so far for all your help :D

EDIT: The initial simulated dataset has clusters with > 1 pair of heavy and light chain.

psathyrella commented 3 years ago

Yes the number of families is controlled by --n-sim-events, which defaults to 1. There's some documentation in the manual section on simulation, but run partis simulation --help for a comprehensive rundown of options and their descriptions. Note also that you can view simulation results directly, without running inference on them, with the view-output action.

andreirajkovic commented 3 years ago

I've been using the view-output action, which is very helpful, but it reveals that there is only a single partition with multiple clusters. I've been able to generate multiple partitions when running in the unpaired mode. However, when executing the paired mode, multi-hmm doesn't run for some reason.

psathyrella commented 3 years ago

Sorry, I don't understand your question. Is this a question about simulation behavior, or inference behavior? Is the problem the number of partitions, or the number of clusters/families in the partitions?

andreirajkovic commented 3 years ago

I'll take a shot at distilling the question. How can I use the simulate --paired-loci action to generate data that will have multiple partitions i.e. after partitioning on the simulated data the cluster_size.csv will have more than 1 line?

psathyrella commented 3 years ago

Ah, yeah, simulation always only stores one partition -- the true one. If you're wanting a series of partitions, you run 'partition' inference, which it sounds like is what you did, I'm just only now understanding the purpose. And unfortunately the algorithm that merges heavy and light chain partitions at the moment only handles one partition at a time. So the single-chain output files will have the normal progression of partitions around the best partition, but the code that uses the info from heavy to improve light, and vice versa, just picks one partition (atm, the best one by default, or the "last" one if --n-final-clusters or --min-largest-cluster-size are set). In principle we could do paired clustering on each pair of partitions sequentially from the heavy and light files, but this wouldn't be very meaningful, since there isn't really any way to decide which h partition would go with which l, for instance. But really the issue is that the paired clustering algorithm operates, basically, by splitting clusters in one chain's partition based on info in the other, which largely due to the complicated overlap removal involved, I'm not sure could really allow a neat display of uncertainty in the form of a series of partitions, in the way you can with single chain partitions.

Probably a better way to say it is that the proper thing to do for paired clustering would be to do exactly what we do for single-chain clustering, evaluating each potential merge in hierarchical agglomeration with likelihood or naive hamming distance, but have both chains evaluated in the same place at the same time. Unfortunately this way of doing it would require a massive rewrite of everything in the guts of partis and bcrham (like, tens of thousands of lines of code) to allow them to work on pairs of h/l sequences, rather than individual sequences. So at least for the foreseeable future we implemented this more ad hoc paired clustering procedure that uses information from the existing other chain's partition, rather than clustering them both at once.

andreirajkovic commented 3 years ago

Thank you for that very detailed explanation! And doing something as simple as combining the heavy and light chain into a single sequence would just generate something nonsensical for the alignment and probably mess up the HMM states.

Anyway I'll stick with partitioning on the heavy chain.

This issue is closed to me as far as I'm concerned. Thanks again!

psathyrella commented 3 years ago

Yeah the structure of the hmm is entirely based on the structure of VDJ rearrangement, concat'ing h and l seqs wouldn't work, they'd have to be run separately, but all the info for both kept together.

Heavy-only partition is probably the way to go, but I would be a bit careful, since while the biggest effect of paired clustering is to massively improve the light chain partition (since light chains are many orders of magnitude less diverse, and thus have very frequent "collisions" between unrelated naive rearrangments), light chain information does also improve the heavy chain partitions on the level of 5-10%. In other words, heavy families from different rearrangement events end up with indistinguishable naive sequences (or pair with multiple light chains) at a non-negligible rate, which is something that's only become measurable now that we have paired data. Basically, the dashed red boxes on these random 10x examle data sets:

psathyrella / partis

simulating paired chains #313