psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
54 stars 36 forks source link

Error in bcrham forward: sequences different length in Glomerator::NaiveHfrac #307

Closed nannabarnkob closed 3 years ago

nannabarnkob commented 3 years ago

Hi there

I get an error that I can't wrap my head around. Hope you guys can help!

I am running partition:

partis partition  --n-procs 24 --locus igk --initial-germline-dir $longpath/germlines/omni_rat_2 --infname $longpath/combined_VL_onlyID_fixed.fa --outfname $longpath/partis_output/combined_VL_onlyID_fixed.yaml --paramet
er-dir $longpath/partis_params/combined_VL_onlyID --workdir /tmp/partis/99888 --leave-default-germline

And I get the following error:

Traceback (most recent call last):
  File "/home/projects/cu_10049/apps/symphogen/partis2020b/partis/bin/partis", line 488, in <module>
    args.func(args)
  File "/home/projects/cu_10049/apps/symphogen/partis2020b/partis/bin/partis", line 222, in run_partitiondriver
    parter.run(actions)
  File "/home/projects/cu_10049/apps/symphogen/partis2020b/partis/python/partitiondriver.py", line 120, in run
    self.action_fcns[tmpaction]()
  File "/home/projects/cu_10049/apps/symphogen/partis2020b/partis/python/partitiondriver.py", line 528, in partition
    cpath = self.cluster_with_bcrham()
  File "/home/projects/cu_10049/apps/symphogen/partis2020b/partis/python/partitiondriver.py", line 756, in cluster_with_bcrham
    cpath, _, _ = self.run_hmm('forward', self.sub_param_dir, n_procs=n_procs, partition=cpath.partitions[cpath.i_best_minus_x], shuffle_input=True)  # note that this annihilates the old <cpath>, which is a memory optimization (bu
t we write all of them to the cpath progress dir)
  File "/home/projects/cu_10049/apps/symphogen/partis2020b/partis/python/partitiondriver.py", line 1135, in run_hmm
    self.execute(cmd_str, n_procs)
  File "/home/projects/cu_10049/apps/symphogen/partis2020b/partis/python/partitiondriver.py", line 1106, in execute
    utils.run_cmds(cmdfos, batch_system=self.args.batch_system, batch_options=self.args.batch_options, batch_config_fname=self.args.batch_config_fname, debug='print' if self.args.debug else None)
  File "/home/projects/cu_10049/apps/symphogen/partis2020b/partis/python/utils.py", line 3270, in run_cmds
    status = finish_process(iproc, procs, n_tries_list[iproc], cmdfos[iproc], n_max_tries, dbgfo=cmdfos[iproc].get('dbgfo'), batch_system=batch_system, debug=debug, ignore_stderr=ignore_stderr, clean_on_success=clean_on_success, a
llow_failure=allow_failure)
  File "/home/projects/cu_10049/apps/symphogen/partis2020b/partis/python/utils.py", line 3388, in finish_process
    raise Exception(failstr)
Exception: exceeded max number of tries (1 >= 1) for subprocess with command:
        /home/projects/cu_10049/apps/symphogen/partis2020b/partis/packages/ham/bcrham --algorithm forward --hmmdir /home/projects/cu_10049/data/NGS_folder/TargetM_2/10X/analysis_with_LN/DM_1040-1043/partis_params/combined_VL_onlyI
D/hmm/hmms --datadir /tmp/partis/99888/germline-sets --infile /tmp/partis/99888/istep-1/hmm-11/hmm_input.csv --outfile /tmp/partis/99888/istep-1/hmm-11/hmm_output.csv --locus igk --random-seed 1600686760 --only-cache-new-vals --in
put-cachefname /tmp/partis/99888/istep-1/hmm-11/hmm_cached_info.csv --output-cachefname /tmp/partis/99888/istep-1/hmm-11/hmm_cached_info.csv --partition --max-logprob-drop 5.0 --hamming-fraction-bound-lo 0.015 --hamming-fraction-b
ound-hi 0.0678319109462 --logprob-ratio-threshold 18.0 --biggest-naive-seq-cluster-to-calculate 15 --biggest-logprob-cluster-to-calculate 15 --n-partitions-to-write 10 --ambig-base N
        stderr:           /tmp/partis/99888/istep-1/hmm-11/err
            terminate called after throwing an instance of 'std::runtime_error'
              what():  sequences different length in Glomerator::NaiveHfrac
                328: NNNNNNNNNNNNNNNNNNGACATCCAGATG... 

Additionally, a binary file named core.34552 is not deleted (I assume it usually is). I don't know if that info is relevant :)

I tried running it with an older version of partis (0.14.0) we have lying around and it worked properly, so I am a bit confused.

Best wishes Nanna

psathyrella commented 3 years ago

I think the most likely is that there's some old files lying around in the workdir you're specifying that are incompatible with the current run. Since the default is to choose a new random workdir for each run, it doesn't try to clean out any old files, so if you're specifying it by hand it assumes you're doing that. It looks like what you're specifying there is basically the default, except with a hard coded random integer, so probably just don't set --workdir?

nannabarnkob commented 3 years ago

Makes sense, but before posting here I also tried deleting the specific workdir and partis params to do a "clean" run and it didn't help much. Tried today without setting --workdir and I still get the same error...

psathyrella commented 3 years ago

hmm, oops, sorry. So what's happening is that the c++ hmm code (which is what's crashing) needs any sequences that it compares to be the same length. It only compares sequences with the same cdr3 length, so the python code that calls it groups all sequences into cdr3 length classes, and pads them with Ns so they have the same number of bases before cyst and after tryp/phen. So the sequence for a given uid that gets passed to the c++ is, in general, different (in a trivial way) depending on the other sequences in the sample: if, say, we add a new sequence with the same cdr3 length and with a super long V, then we'll need to pad a bit more Ns onto all the sequences in this cdr3 length class.

So, my next guess as to why this is happening is that the parameters were inferred on a different set of sequences. The most likely culprit would be the file in which we cache smith waterman results (called sw-cache-<hash>.yaml in the parameter dir). The hash is a hash of the uids that went into it, so it'll re-run sw if the uids change, but that isn't an infinitely clever way to figure out whether it needs to rerun. If the parameters were inferred on only a slightly different set of sequences (so you think the parameters are valid, it's just the sw cache file that's a bit out of whack), you can force to just rerun sw by setting --sw-cachefname when you partition to something random that doesn't exist, although it'd be safer to re-infer parameters.

If you'd be able paste the whole std out it would help figure out if this is what's happening.

nannabarnkob commented 3 years ago

No worries, thanks for the elaborate answer! It's kind of cool to know how/why things are named the way they are and how partis knows stuff. I have a hard time seeing why parameters should be figured out on a different set of sequences when cache parameters and partition is run together instead of in two separate calls. Does partis do subsampling or have a max number of sequences it can work with while caching parameters?

Now that you mention it we do maybe have something a bit odd in the std out:
spent much longer waiting for bcrham (127.5s) than bcrham reported taking (max per-proc time 5.2s) and proc 11 try 1 failed with exit code -6, and output is missing: /tmp/nanbar/hmms/611360/istep-0/hmm-11/hmm_output.csv


  parameter dir does not exist, so caching a new set of parameters before running action 'partition': _output/_home_projects_cu_10049_data_NGS_folder_TargetM_2_XXX_analysis_with_XX_DM_1040-1043_combined_VL_onlyID_fixed
caching parameters
smith-waterman  (writing parameters)
  vsearch: 37730 / 37730 v annotations (0 failed) with 18 v genes in 18.7 sec
    running 24 procs for 37730 seqs
    running 25 procs for 21 seqs
      info for 37730 / 37730 = 1.000   (0 failed)
      kept 587 (0.016) unproductive
        writing sw results to _output/_home_projects_cu_10049_data_NGS_folder_TargetM_2_XXX_analysis_with_XX_DM_1040-1043_combined_VL_onlyID_fixed/sw-cache-660904706411338117.yaml
    writing parameters to _output/_home_projects_cu_10049_data_NGS_folder_TargetM_2_XXX_analysis_with_XX_DM_1040-1043_combined_VL_onlyID_fixed/sw (4.8 sec)
    water time: 159.6  (ig-sw 29.1  processing 0.4)
  writing hmms (2.3 sec)
hmm
    skipping matches from 5 genes without enough counts: kv2-29 kj402 kj203 kj202 kv6-21
    prepare_for_hmm: (3.0 sec)
    running 24 procs
                    calcd:         vtb 37730        fwd     0
             min-max time:  4.9 - 5.2 sec
    spent much longer waiting for bcrham (127.5s) than bcrham reported taking (max per-proc time 5.2s)
    read output
    writing parameters to _output/_home_projects_cu_10049_data_NGS_folder_TargetM_2_XXX_analysis_with_XX_DM_1040-1043_combined_VL_onlyID_fixed/hmm (2.6 sec)
        processed 37730 hmm output lines with 37730 sequences in 37730 events  (0 failures)
         infra time: 52.1
      hmm step time: 179.6
  writing hmms (2.2 sec)
partitioning
hmm
caching all 37730 naive sequences
    prepare_for_hmm: (3.0 sec)
    running 38 procs
                no/empty cache file
                    calcd:         vtb 37730        fwd     0
                   merged:       hfrac     0     lratio     0
             min-max time:  3.5 - 3.8 sec
    spent much longer waiting for bcrham (144.7s) than bcrham reported taking (max per-proc time 3.8s)
         infra time: 3.7
      hmm step time: 148.3
   collapsed 37730 queries into 978 clusters with identical naive seqs (0.1 sec)
978 clusters with 24 procs
    prepare_for_hmm: (2.3 sec)
       naive hfrac bounds: 0.015 0.068
    running 24 procs
    proc 11 try 1 failed with exit code -6, and output is missing: /tmp/nanbar/hmms/611360/istep-0/hmm-11/hmm_output.csv
        progress file (/tmp/nanbar/hmms/611360/istep-0/hmm-11/hmm_output.csv.progress):

Traceback (most recent call last):
  File "/home/projects/cu_10049/apps/symphogen/partis2020c/partis/bin/partis", line 739, in <module>
    args.func(args)
  File "/home/projects/cu_10049/apps/symphogen/partis2020c/partis/bin/partis", line 250, in run_partitiondriver
    parter.run(actions)
  File "/home/projects/cu_10049/apps/symphogen/partis2020c/partis/python/partitiondriver.py", line 120, in run
    self.action_fcns[tmpaction]()
  File "/home/projects/cu_10049/apps/symphogen/partis2020c/partis/python/partitiondriver.py", line 529, in partition
    cpath = self.cluster_with_bcrham()
  File "/home/projects/cu_10049/apps/symphogen/partis2020c/partis/python/partitiondriver.py", line 757, in cluster_with_bcrham
    cpath, _, _ = self.run_hmm('forward', self.sub_param_dir, n_procs=n_procs, partition=cpath.partitions[cpath.i_best_minus_x], shuffle_input=True)  # note that this annihilates the old <cpath>, which is a memory optimization (but we write all of them to the cpath progress dir)
  File "/home/projects/cu_10049/apps/symphogen/partis2020c/partis/python/partitiondriver.py", line 1117, in run_hmm
    self.execute(cmd_str, n_procs)
  File "/home/projects/cu_10049/apps/symphogen/partis2020c/partis/python/partitiondriver.py", line 1088, in execute
    utils.run_cmds(cmdfos, batch_system=self.args.batch_system, batch_options=self.args.batch_options, batch_config_fname=self.args.batch_config_fname, debug='print' if self.args.debug else None)
  File "/home/projects/cu_10049/apps/symphogen/partis2020c/partis/python/utils.py", line 3399, in run_cmds
    status = finish_process(iproc, procs, n_tries_list[iproc], cmdfos[iproc], n_max_tries, dbgfo=cmdfos[iproc].get('dbgfo'), batch_system=batch_system, debug=debug, ignore_stderr=ignore_stderr, clean_on_success=clean_on_success, allow_failure=allow_failure)
  File "/home/projects/cu_10049/apps/symphogen/partis2020c/partis/python/utils.py", line 3517, in finish_process
    raise Exception(failstr)
Exception: exceeded max number of tries (1 >= 1) for subprocess with command:
        /home/projects/cu_10049/apps/symphogen/partis2020c/partis/packages/ham/bcrham --algorithm forward --hmmdir /home/projects/cu_10049/data/NGS_folder/TargetM_2/XXX/analysis_with_XX/DM_1040-1043/_output/_home_projects_cu_10049_data_NGS_folder_TargetM_2_XXX_analysis_with_XX_DM_1040-1043_combined_VL_onlyID_fixed/hmm/hmms --datadir /tmp/nanbar/hmms/611360/germline-sets --infile /tmp/nanbar/hmms/611360/istep-0/hmm-11/hmm_input.csv --outfile /tmp/nanbar/hmms/611360/istep-0/hmm-11/hmm_output.csv --locus igk --random-seed 1601032113 --only-cache-new-vals --input-cachefname /tmp/nanbar/hmms/611360/istep-0/hmm-11/hmm_cached_info.csv --output-cachefname /tmp/nanbar/hmms/611360/istep-0/hmm-11/hmm_cached_info.csv --partition --max-logprob-drop 5.0 --hamming-fraction-bound-lo 0.015 --hamming-fraction-bound-hi 0.0678319109462 --logprob-ratio-threshold 18.0 --biggest-naive-seq-cluster-to-calculate 15 --biggest-logprob-cluster-to-calculate 15 --n-partitions-to-write 10 --ambig-base N
        stderr:           /tmp/nanbar/hmms/611360/istep-0/hmm-11/err
            terminate called after throwing an instance of 'std::runtime_error'
              what():  sequences different length in Glomerator::NaiveHfrac
                328: NNNNNNNNNNNNNNNNNNGACATCCAGATGACCCAGTCTCCTTCCACCCTGTCTGCATCTGTAGGAGACAGAGTCACCATCACTTGCCGGGCCAGTCAGAGTATTAGTAG.....GACCAAGGTGGAAATCAAAC
                0:
psathyrella commented 3 years ago

""" I have a hard time seeing why parameters should be figured out on a different set of sequences when cache parameters and partition is run together instead of in two separate calls. Does partis do subsampling or have a max number of sequences it can work with while caching parameters? """ I don't think I'm quite understanding, but parameters should generally be cached on the same sequences as you want to partition. The stuff we're running into (like using a hash of uids in the sw cache file name) are various ways that it tries to guess what it was supposed to do in cases where it wasn't run on exactly the same sequences. I almost always run cache-parameters separately because usually I either have a bunch of cache-parameters-specific options or I'll be running several different partition steps on the same parameters (e.g. several different seed partition runs with different seeds). In any case none of what i was guessing above seems to apply, since it's inferring parameters here (I was thinking they were already there).

By default it doesn't do any subsampling. There isn't much benefit in terms of accuracy to caching parameters with more than, I dunno, say 100k sequences, so you can use --n-random-queries or --n-max-queries to tell it to subsample if you want it to run faster, although I generally don't do that. Parameter caching is only linear in the number of sequences, and it never takes that long anyway.

""" spent much longer waiting for bcrham (127.5s) than bcrham reported taking (max per-proc time 5.2s) """ This isn't necessarily worrisome, it all depends what sort of batch/disk/hardware system you're using. This is saying that the most any individual bcrham/c++ process itself took was 5s, but that the python process that's driving things had to wait for a couple minutes for all the subprocs to finish. The most common reasons for this are 1) you ran way more procs than you have cores available, e.g. with --n-procs 100 when you only have a laptop, 2) lots of lag in starting jobs on a batch system, or super laggy i/o from a crappy networked file system.

""" proc 11 try 1 failed with exit code -6, and output is missing: /tmp/nanbar/hmms/611360/istep-0/hmm-11/hmm_output.csv """ this is worse tho, this is probably why it's crashing. If you're on a batch system it automatically restarts jobs that fail, but i think by default if you're just running locally it only tries once, since you don't really expect jobs to fail for no reason/cause the batch system hamsters were tired. I think -6 might be that it ran out of memory? The .core file is also the kind of thing the OS I think typically drops when it killed a process that was using too much memory. If your machine has 24 cores I'd expect it could handle partitioning only 40k sequences, but the first thing to try is caching parameters separately first (i.e. just don't delete the ones that were successfully cached above), since it probably isn't clearing all the memory from that before starting to partition.

psathyrella commented 3 years ago

Other things you can do to make partitioning either faster or use less memory are listed here, i'm guessing vsearch partition or subsampling would be best for your case.

psathyrella commented 3 years ago

oh, and the progress file it mentions in your output (/tmp/nanbar/hmms/611360/istep-0/hmm-11/hmm_output.csv.progress) will give some details about what was going on. There's a bunch of info on it here.

nannabarnkob commented 3 years ago

Hi there

Thanks for the very elaborate answer. I didn't know the problem would be related to memory. Sorry I hadn't read the section on subsampling before. So I tried requesting much more memory for my job (we're running partis on a cluster), and it unfortunately didn't solve the problem on its own. I tried looking into the *progress file but it was empty, though the remaining files were not and I did find the error in the err file. Couldn't get much info from the others.

So I looked into the options you listed under partition and tried --small-clusters-to-ignore 1-10 and --max-cluster-size 10000 - I was able to run partis successfully with a combination of these two parameters. I will look into it more later to see what's going on.

nannabarnkob commented 3 years ago

Hi again

Just wanted to let you know I think you can close this issue and get it out of your sight. I tried some different things and used a smaller set of sequences and merged clusters afterwards if they had the same naive sequence. Our compute cluster has been acting weird lately concerning everything related to c++ so it may be on our part anyways.

Congrats on your publication, I think I might have to return with some questions on how to input your own phylogenetic trees.

psathyrella commented 3 years ago

wooooo!

and thanks!