psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
54 stars 34 forks source link

Suggested optimization/work-around for out of memory error? #274

Closed scharch closed 5 years ago

scharch commented 5 years ago

Working with about 2 million sequences on a machine with 32 GB of RAM.

smith-waterman  (writing parameters)
  vsearch: 1891989 / 1969347 v annotations (77358 failed) with 75 v genes in 2121.8 sec
    running 5 procs for 1964148 seqs
    running 8 procs for 28202 seqs

Traceback (most recent call last):
  File "/home/schrammca/bin/partis", line 404, in <module>
    args.func(args)
  File "/home/schrammca/bin/partis", line 193, in run_partitiondriver
    parter.run(actions)
  File "/home/schrammca/partis/partis/python/partitiondriver.py", line 104, in run
    self.action_fcns[tmpaction]()
  File "/home/schrammca/partis/partis/python/partitiondriver.py", line 244, in cache_parameters
    self.run_waterer(count_parameters=True, write_parameters=True, write_cachefile=True, dbg_str='writing parameters')
  File "/home/schrammca/partis/partis/python/partitiondriver.py", line 190, in run_waterer
    waterer.run(cachefname if write_cachefile else None)
  File "/home/schrammca/partis/partis/python/waterer.py", line 104, in run
    self.execute_commands(base_infname, base_outfname, mismatches, gap_opens)
  File "/home/schrammca/partis/partis/python/waterer.py", line 288, in execute_commands
    utils.run_cmds(cmdfos, batch_system=self.args.batch_system, batch_options=self.args.batch_options, batch_config_fname=self.args.batch_config_fname)
  File "/home/schrammca/partis/partis/python/utils.py", line 2283, in run_cmds
    procs.append(run_cmd(cmdfos[iproc], batch_system=batch_system, batch_options=batch_options))
  File "/home/schrammca/partis/partis/python/utils.py", line 2272, in run_cmd
    env=cmdfo['env'])
  File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1235, in _execute_child
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
psathyrella commented 5 years ago

uf, and it's during parameter caching too, which isn't usually that memory hungry. Well I'm somewhat guessing, but since it crashed when it was fork()ing during the second sw step (for failed or shm indel'd seqs, which it's running with more procs, which looks weird, but is because I have to send ig-sw different procs for different match/mismatch/gap scores, and I don't get the indel info until the first run is done to know how to split them up properly). Anyway, that suggests that what used too much memory was building the main sw info dict when it was processing the sequences that passed the first step. Which means it's probably not the subprocesses using memory, so changing that number probably won't help. And I'm sure there are some memory optimizations for the main sw info dict, but it would involve some serious tradeoffs with computation time I think.

Anyway, I think you'll have to subsample if that's the only machine available (the server I'm usually running on with big samples has 200GB, so I only run out of memory if I'm running a bunch of samples at once). For parameter caching, there should be zero benefit whatsoever of using 2 million sequences instead of, I dunno 1 million or 500k. Use --n-random-queries (don't use --n-max-queries unless you're sure your file is randomly sorted, since it just takes the first N). If you're partitioning after, you'll probably also want to use --n-random-queries. And the careful thing to do for both of those steps would be to do several different subsamples and compare results (I sometimes do that even if I can run on the whole sample to get at least some idea of the uncertainties). Another option that can be useful is --istartstop, which makes it easy to slice files into non-overlapping subsets.

scharch commented 5 years ago

OK, I will try subsetting for now, I think that will be sufficient for my purposes. One of these days I will get partis working on our cluster....

psathyrella commented 5 years ago

huh, darn this didn't get it working on your batch system? well lmk if there's anything I can fix on my end.

scharch commented 5 years ago

I never got that far. I ran into some problem (don't even remember what) with the prereqs and decided it wasn't worth my time. Odds are it was probably a pretty trivial obstacle.

Do I need to worry about loss of accuracy if I annotate using the --istartstop option? I assume if I am doing 500k at a time it should be fine, but thought I'd check.

psathyrella commented 5 years ago

It depends, but short answer is that 500k is vastly more than you need for most purposes.

This plot from the annotation paper gives a good idea, and it would say that the accuracy doesn't improve that much even beyond 5k:

p

That number (5k) is of course entirely dependent on the clonal structure of the sample -- your accuracy will still kind of suck with a million sequences if they're all from the same family -- but it should be roughly true. I can't believe I didn't put the info in this plot, but I think this was with an average of 5 seqs per family.

Germline inference (which we didn't have when we made this plot) is necessary to squeeze out the last few bases of naive accuracy within the V, and for that, much higher sample sizes than 5k are needed. That depends on the clonal structure and also on the SHM rate, the latter because it's only the first few bins (0-5 mutations in V) in the mutation accumulation plot that give you sensitivity to most alleles. 500k is still an extremely large sample in most cases (as long as it's not super clonal), so it should be fine. Running on several different subsets with --istartstop would tell you how much difference it's making, of course.