psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
57 stars 34 forks source link

Partis Killed - no other error message #310

Closed lindsly closed 3 years ago

lindsly commented 3 years ago

Dear Partis team,

When running partition on one of my fasta files, Partis appears to spend a significant amount of time on one of the steps. I came back to the run a few hours after starting Partis and it got to the step shown below.

I tried to copy something from the command prompt, which caused Partis to be killed. I'm not sure if it was killed by something that I did, or if there was an error that wasn't causing the program to be killed until I interacted with the command prompt. Any insight on how to address this would be great!

(base) root@af7131fba3e1:/partis# ./bin/partis partition --infname /host/home/Desktop/partis_fa_rc/ova14_rc.fasta --outfname /host/home/Desktop/ova14_partis_out/ova14-partition.yaml --n-procs 12 --species mouse --plotdir /host/home/Desktop/ova14_partis_out/plots/
  non-human species 'mouse', turning on allele clustering
  note: --parameter-dir not set, so using default: _output/_host_home_Desktop_partis_fa_rc_ova14_rc
partitioning
smith-waterman
  vsearch: 23962 / 24227 v annotations (265 failed) with 75 v genes in 25.9 sec
        reading sw results from _output/_host_home_Desktop_partis_fa_rc_ova14_rc/sw-cache-5985935001742782714.yaml
      info for 24159 / 24227 = 0.997   (removed: 68 failed, 0 duplicates)
      kept 23683 (0.978) unproductive
        water time: 67.7
hmm
caching all 24159 naive sequences
        hmm prep time: 4.4
    running 2 procs
                no/empty cache file
                    calcd:         vtb 24159        fwd     0
                   merged:       hfrac     0     lratio     0
             min-max time:  218.9 - 219.3 sec
         infra time: 4.9
      hmm step time: 225.6
    warning: missing 45 queries from hmm cache file (using sw naive sequence instead): 66436a86-d76e-43f9-b178-c5d039b97ca4 ea6e6f92-11fc-4073-a61c-c72ce6c8d888 c2352dd9-1ce6-44e8-8686-8fdc106f04c6 391bccf9-efc0-4eb2-ab92-1eb08c1e8a80 89165c2b-69b6-4871-b5a1-9443824d0757 38a2da54-3ae5-4ae6-af10-0effefbf48b6 48b0e9dc-d5b7-4642-af23-bf3366b49e99 83babfa5-b5d8-443e-aa0f-33a1d8aeaecf 072cd8cc-084f-48b5-bbdb-7098b477a2af 7ab4c3a1-e131-45c9-8697-2e2228d38db9 65e74287-b854-4fa6-b705-727e7ec2e142 cfc578a1-7ca4-4c59-ba7d-1ddd3010f9e7 bb97e738-0672-440b-af36-d90020880483 a84038ed-2fa7-4c15-8ac8-8b91dff3e673 ea79f75e-a9f7-4fe7-8f7b-3badadce5a3d 8f7f6733-2283-4da6-a5ba-3184037b9289 e4925a3f-3bed-4db7-be58-fb8443d66a2c 4f929124-352b-4404-ad94-3a2baad02098 30564d1e-3b28-43c1-ae98-39700e011f3b 3797c5f4-9a15-4274-857d-e654340fc760 ce6f56d2-5043-4095-b1b2-e8acd0675222 b2423939-0916-47a4-8ba5-0661eb267e35 2c10e208-e468-43a9-9063-93325b388f09 1605ffb4-6244-439a-967b-43c6ad576a63 741af33d-2b06-4afe-ba1c-3bb4d4e0fac1 b4eaf4a0-f2ee-415e-80d2-4b91cf0ecb04 6e30af0c-bcd5-4e69-8fb2-a799b485adba 0b10e933-8965-4e22-acee-601e48f0eaa2 15bb2798-503d-42ff-9a06-184984072b34 ddf2a29a-36c5-48bb-9e0c-028d970b3131 2e004ea8-67f5-457e-9d4f-4abf9582ad12 3548f617-68ca-44f4-bcf7-793c135b7276 7e4ac90c-b8d6-4f84-b5fb-44d0cc2f22f8 a822a7d9-cffa-4539-96af-6e2bb2f7e8f2 ebd74aa7-5e60-4077-bb89-9ed792411a70 b641dafd-e1cf-4871-a1e6-2b6caa368d56 8f3ee381-5b85-4043-9462-ba985b119840 3d7e6232-fd03-4858-9650-6979bb87cdc7 a981dc69-80a5-4e81-8f6e-06762f138873 1cfba1d3-2cdf-450b-be19-00f509967db7 64f091a5-666d-4335-860e-3dc5421e5026 d7807064-fbb3-4ffe-bfd8-8571a1bb1b2b 98d55358-8903-4b59-bc40-bbaa9d601447 77d60e0e-f3d2-45fc-bd16-1e13ada7dc7b 224af098-df1f-49b3-b430-2943723ba8c4
   collapsed 24159 queries into 13251 clusters with identical naive seqs (0.1 sec)
13251 clusters with 12 procs
        hmm prep time: 4.2
       naive hfrac bounds: 0.015 0.093
    running 12 procs
          read from cache:  naive-seqs 24114   logprobs     0
                    calcd:         vtb  3186        fwd 15852
                   merged:       hfrac   880     lratio   548
             min-max time:  160.3 - 218.0 sec
    spent much longer waiting for bcrham (1150.2s) than bcrham reported taking (max per-proc time 218.0s)
         infra time: 34.3
      hmm step time: 1184.6
11823 clusters with 12 procs
        hmm prep time: 4.6
    running 12 procs
Killed
(base) root@af7131fba3e1:/partis#

Some additional information/questions

I am running Partis using Docker on a Windows 10 machine with 6 cores (12 logical processors). I have used the same input command for two other files (which were smaller) and it completed running in a matter of minutes.

Sorry about the title - I had another question regarding MDS (why I was copying from the command prompt in the first place) but I realized I didn't have the R package installed.

psathyrella commented 3 years ago

I would suspect that the "Killed" indicates it got a signal from stdin equivalent to a ctrl-C or something, although I have no experience running things on windows so I'm not sure what could have caused that. I mean in theory i'd imagine you're entirely in docker so the fact you're on windows wouldn't matter, but in practice maybe mouse copy/pasting sent it a term signal or something. The "spent much longer" warning is quite informative about why it was slow, although I don't know if that's related to it being killed. This is saying that the main python proc spit out 12 sub procs to run bcrham, and each of those reported taking only a couple hundred seconds, whereas the whole step of writing their input and processing their output took 1150 seconds, which suggests maybe your i/o is super slow, or maybe it's completely out of memory and swapping like crazy.

If that isn't it -- 24k sequences isn't very many, and while partition time is quite hard to predict since it depends so much on the repertoire structure, on 15 cores on a typical server I'd expect that to take less than an hour. It writes a "progress" file while it's doing the clustering steps that should give you a good idea how it's doing. Another thing you can do is run on a subsample just to test that it's finishing quickly + ok.

Yeah the plotting install is unfortunate, hopefully you found this. I'm actually just decided to finally make a separate multi step docker build and switch to quay so I can have a separate docker image for plotting, although that won't be finished for a bit.

lindsly commented 3 years ago

Thank you for your reply! I tried running the same command again without ever touching the command prompt and I got the Killed message again. Also, I tested a ctrl+c interrupt and it outputs the following (just for your reference):

^CTraceback (most recent call last):
  File "./bin/partis", line 805, in <module>
    args.func(args)
  File "./bin/partis", line 261, in run_partitiondriver
    parter.run(actions)
  File "/partis/python/partitiondriver.py", line 125, in run
    self.action_fcns[tmpaction]()
  File "/partis/python/partitiondriver.py", line 522, in partition
    self.run_waterer(look_for_cachefile=not self.args.write_sw_cachefile, write_cachefile=self.args.write_sw_cachefile, count_parameters=self.args.count_parameters)  # run smith-waterman
  File "/partis/python/partitiondriver.py", line 198, in run_waterer
    self.set_vsearch_info(get_annotations=True)
  File "/partis/python/partitiondriver.py", line 247, in set_vsearch_info
    self.vs_info = utils.run_vsearch('search', seqs, self.args.workdir + '/vsearch', threshold=0.3, glfo=self.glfo, print_time=True, vsearch_binary=self.args.vsearch_binary, get_annotations=get_annotations, no_indels=self.args.no_indels)
  File "/partis/python/utils.py", line 4929, in run_vsearch
    run_cmds(cmdfos)
  File "/partis/python/utils.py", line 3509, in run_cmds
    time.sleep(per_proc_sleep_time)
KeyboardInterrupt

My output was set to a folder on my host system (outside of Docker) so that may be what's causing the slowdown. I am running it again now with the output set within docker and appears (so far) to be working. I will also monitor the memory usage as it goes.

Attempting to install plotting packages led to a lot dependency issues so I may just hold off on that for now.

psathyrella commented 3 years ago

ok, great that's the same ctrl-c message I'm used to. Then I'd guess it's a memory issue, when I've gotten similar things it was the OS's out-of-memory killer killing it. I'm surprised if that's the problem, since it doesn't usually use much memory on 24k sequences, at least compared to what's on a box with 12 cores, but I don't know how much memory is really there. There's a lot of different ways to optimize/approximate for speed and memory though.

yeah, sorry about the dependency issues, I'll get the new docker image up as soon as I can. Meanwhile someone else kept track of how they got plotting working in docker last week, so this will likely fix things (the difference to what's in the manual i think is just the numpy update and the explicit list of bios2mds deps):

    apt-get install -y xorg libx11-dev libglu1-mesa-dev r-cran-rgl
    conda install -y -cr r-rgl r-essentials
    conda update -y numpy
    R --vanilla --slave -e 'install.packages(c("bios2mds","picante","pspline","deSolve","igraph","TESS","fpc","pvclust","corpcor","phytools","mvMORPH","geiger","mvtnorm","glassoFast","Rmpfr"), repos="http://cran.rstudio.com/")'
    mkdir -p packages/RPANDA/lib
    R CMD INSTALL -l packages/RPANDA/lib packages/RPANDA/
lindsly commented 3 years ago

Some updates and another test:

I was unable to run the full fasta even with the modified output folder location (killed again). I was able to successfully partition the fasta using a subsample of 5000 sequences though so I'm guessing that memory is the issue. My machine has 32GB RAM available so I wouldn't expect for this to be a problem. Also, I have kept up task manager while running the full file and didn't notice any huge spikes in memory usage.

I also tried another file which is larger (47k sequences) to see if it was the fasta file itself or really a memory problem but before it had a chance to be killed, I got the following exception:

(base) root@af7131fba3e1:/partis# ./bin/partis partition --infname /host/home/Desktop/partis_fa_rc/lys14_rc.fasta --outfname lys14_partis_out/lys14-partition.yaml --n-procs 12 --species mouse --small-clusters-to-ignore 1-10 --paramete
r-dir lys14-full-parameter-dir
  non-human species 'mouse', turning on allele clustering
  parameter dir does not exist, so caching a new set of parameters before running action 'partition': lys14-full-parameter-dir
caching parameters
  vsearch: 46479 / 47054 v annotations (575 failed) with 183 v genes in 31.2 sec
    keeping 62 / 261 v genes
smith-waterman  (new-allele clustering)
  vsearch: 46444 / 47054 v annotations (610 failed) with 62 v genes in 62.9 sec
    running 12 procs for 47054 seqs
Traceback (most recent call last):
  File "./bin/partis", line 805, in <module>
    args.func(args)
  File "./bin/partis", line 261, in run_partitiondriver
    parter.run(actions)
  File "/partis/python/partitiondriver.py", line 125, in run
    self.action_fcns[tmpaction]()
  File "/partis/python/partitiondriver.py", line 264, in cache_parameters
    self.run_waterer(dbg_str='new-allele clustering')
  File "/partis/python/partitiondriver.py", line 221, in run_waterer
    waterer.run(cachefname if write_cachefile else None)
  File "/partis/python/waterer.py", line 108, in run
    self.read_output(base_outfname, len(mismatches))
  File "/partis/python/waterer.py", line 490, in read_output
    self.summarize_query(qinfo)  # returns before adding to <self.info> if it thinks we should rerun the query
  File "/partis/python/waterer.py", line 979, in summarize_query
    indelfo = self.combine_indels(qinfo, best)  # the next time through, when we're writing ig-sw input, we look to see if each query is in <self.info['indels']>, and if it is we pass ig-sw the indel-reversed sequence, rather than the <input_info> sequence
  File "/partis/python/waterer.py", line 1559, in combine_indels
    return indelutils.combine_indels(regional_indelfos, full_qrseq, qrbounds, uid=qinfo['name'], debug=debug)
  File "/partis/python/indelutils.py", line 645, in combine_indels
    raise Exception('%sqr_gap_seq non-gap length %d not the same as qrbound length %d in %s region indelfo' % ('%s: ' % uid if uid is not None else '', utils.non_gap_len(rfo['qr_gap_seq']), qrbounds[region][1] - qrbounds[region][0], region))
Exception: a43659fe-3301-40c5-93b2-cda064707bde: qr_gap_seq non-gap length 249 not the same as qrbound length 248 in v region indelfo

I saw that there was another issue in 2018 (link) with this same exception, but it looks like it was successfully addressed. Any ideas what may be going wrong? Here is the read that causes the exception:

>a43659fe-3301-40c5-93b2-cda064707bde
GTGACTGGAGTTCAGACGTGCTCTTCCGATCTGGGGACTTCAGTGAAGATGTCCTGTAAGGCTTCTGGATACACCTTCACTAACTACTGGATAGGTTAGCAAAGCAGAGGCCTGGACATGGCCTTGAGTGGATTGGAGATATTTACCCTGGAGGTGCTTATATTAACTACAATGAAGTTCAAGGGCAAGGCCACACTGACTGCAGACAAATCCTCCAGCACAGCCTCCATGCAGTTCAGCAGCCTGACATCTGAGGACTCTGCCATCTATTACTGTGCAAGAAAGAATTACTACGGTAATACCTACTTTGACTACCGGGGCCAAGGCACCACTCAGTCTCCTCAGCC
lindsly commented 3 years ago

I wasn't sure exactly which log file to look through but here is the info from my latest run on the first full fasta (~24k sequences).

Command (used --n-procs 6 instead of 12 to see if that would help): ./bin/partis partition --infname /host/home/Desktop/partis_fa_rc/ova14_rc.fasta --outfname _output/ova14_output/ova14-partition.yaml --n-procs 6 --species mouse --small-clusters-to-ignore 1-10 --parameter-dir _output/ova14_output/parameter-dir

(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561# ls
cluster-path-progress  germline-sets  hmm_cached_info.csv  hmm_input.csv  istep-0
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561# cd istep-0
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561/istep-0# ls
hmm-0  hmm-1  hmm-2  hmm-3  hmm-4  hmm-5
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561/istep-0# cd hmm-5
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561/istep-0/hmm-5# ls
err  hmm_cached_info.csv  hmm_input.csv  hmm_output.csv.progress  out
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561/istep-0/hmm-5# less hmm_output.csv.progress

hmm_output.csv.progress from the latest (?) hmm folder: Google Drive text file

Is this where I should be checking for any info on why the program was killed? Nothing appears to be wrong in this file from what I can tell. The "err" file was empty.

A side question: Is there a good way to extract out the CDR3 region from the fastq file generated using "./bin/extract-fasta.py"? I see that the .yaml file includes the keys "codon_positions": {"j": 327,"v": 291} and "cdr3_length": 39 for a particular cluster but I'm not quite sure how to translate this to the final fastq file. Thanks!

psathyrella commented 3 years ago

argggggg that exception just won't die. I'd convinced myself that that it couldn't get triggered any more, but it looks like I'll just have to figure out a way to skip sequences that trigger it instead. I'll try to get to that tomorrow.

Yeah unfortunately running on that sequence alone doesn't reproduce the error for me, but the sequence is unproductive, so if you don't need unproductive sequences, setting --skip-unproductive may avoid the error for you. p

So the log file says that particular bcrham process alone is using 7% of your memory, so multiplied by 6 is close to half, plus adding in the memory used by the python process that's spawned the bcrham procs it's likely the OOM killing it: p Unfortunately the nature of clustering is that both the time and memory required are highly dependent on the structure of the repertoire (not just its size). For instance a repertoire where everybody's either super similar or very different will be quite quick and easy, but if there's tons of sequences that are similar to each other but not super close, it has to do a lot more work since the approximate methods can't do as much. Ignoring small clustersis likely to make the biggest difference in reducing the memory footprint. But oh, wait, that looks like it says it only has access to 2GB, not 32. Maybe your docker image is only getting a small allocation, that could be increased?

At the moment you'll have to add a line or two to either bin/extract-fasta.py or bin/example-parse-output.py, but another thing I may get to tomorrow is adding a command line arg to them to make it simpler to extract a single column like cdr3 length. Adding print cluster_annotation['cdr3_length'] at this point will print the cdr3 len for the largest cluster; remove the 'break' to get the rest of them. I'm not a big fan of adding meta info to fasta files, since there's so many different formats for doing so that are all mutually incompatible, but you could do a similar thing (with more work) in extract-fasta.py

lindsly commented 3 years ago

Ah, I didn't realize that Docker imposes a memory cap. I actually installed Docker for the first time for this program so I'm still getting used to this whole process.

Hopefully that change will resolve the issues I'm having. I will run the larger files tonight to make sure. Thank you very much for all your help!

lindsly commented 3 years ago

The memory increase appears to have solved the original "killed" issue!

I tried running the second larger file that had the length exception again, this time using the --skip-unproductive setting, but another sequence (apparently productive) also had the same exception.

Another basic question I have is for the output file from bin/example-parse-output.py. I am able to see the output of the command

./bin/example-parse-output.py --fname _output/lys6_partis_out/lys6-partition.yaml --glfo-dir /partis/data/germlines/mouse/ > _output/lys6_partis_out/parsed_output.txt

nicely using less -R parsed_output.txt but when I transfer that folder to my working folder on my PC and try to view it, the colored text and other formatting causes the file to be very difficult to read. Example section below:

NNNNNNGAGGTGAAGCTTCTCCAGTCTGGAGGTGGCCTG**GCAGCCTTGAGGATCCCTGGAAACTCTCCTGTGCAGCCTCAGGAATCGATTTTAGTAGATACTGGATGAGTTAACT**GGCGGGCTCCAGGGAAAGGACTAGAATGGATTGGAGAAATTAATCCAGATAGCAGTACAATAAACTATGCACCATCTCTAAAGGATAAATTCATCATCCTTGCAGTGACAACGCCAAAATACGCTGTGTACCTTCCTGCAAATGAGTGAAAGTGTGAGATCTGGAGGACACAGCCCTTTATTACTGTGCAAAAGGGGGCGGTTACTATGCTATGGACTACTGGGGTCAAGGAAACCTCAGTCACTGTCTCCTCA 20ba4fc1-2906-4dc8-857e-16da8daf4e26 14.6 out of frame cdr3, stop codon

Is there a better way to write this file (I have tried .txt, .csv so far) and view it in Windows? Thanks!

psathyrella commented 3 years ago

Great, glad the memory fixed it.

I'll try to get to the exception in a bit.

Yeah so those are ansi color codes. It looks like windows terminals do support those, so maybe just view in a windows terminal? It appears there's also a windows version of less, or you can also just strip them from the log files.

psathyrella commented 3 years ago

ok this should fix the length exception.