Closed lindsly closed 3 years ago
I would suspect that the "Killed" indicates it got a signal from stdin equivalent to a ctrl-C or something, although I have no experience running things on windows so I'm not sure what could have caused that. I mean in theory i'd imagine you're entirely in docker so the fact you're on windows wouldn't matter, but in practice maybe mouse copy/pasting sent it a term signal or something. The "spent much longer" warning is quite informative about why it was slow, although I don't know if that's related to it being killed. This is saying that the main python proc spit out 12 sub procs to run bcrham, and each of those reported taking only a couple hundred seconds, whereas the whole step of writing their input and processing their output took 1150 seconds, which suggests maybe your i/o is super slow, or maybe it's completely out of memory and swapping like crazy.
If that isn't it -- 24k sequences isn't very many, and while partition time is quite hard to predict since it depends so much on the repertoire structure, on 15 cores on a typical server I'd expect that to take less than an hour. It writes a "progress" file while it's doing the clustering steps that should give you a good idea how it's doing. Another thing you can do is run on a subsample just to test that it's finishing quickly + ok.
Yeah the plotting install is unfortunate, hopefully you found this. I'm actually just decided to finally make a separate multi step docker build and switch to quay so I can have a separate docker image for plotting, although that won't be finished for a bit.
Thank you for your reply! I tried running the same command again without ever touching the command prompt and I got the Killed message again. Also, I tested a ctrl+c interrupt and it outputs the following (just for your reference):
^CTraceback (most recent call last):
File "./bin/partis", line 805, in <module>
args.func(args)
File "./bin/partis", line 261, in run_partitiondriver
parter.run(actions)
File "/partis/python/partitiondriver.py", line 125, in run
self.action_fcns[tmpaction]()
File "/partis/python/partitiondriver.py", line 522, in partition
self.run_waterer(look_for_cachefile=not self.args.write_sw_cachefile, write_cachefile=self.args.write_sw_cachefile, count_parameters=self.args.count_parameters) # run smith-waterman
File "/partis/python/partitiondriver.py", line 198, in run_waterer
self.set_vsearch_info(get_annotations=True)
File "/partis/python/partitiondriver.py", line 247, in set_vsearch_info
self.vs_info = utils.run_vsearch('search', seqs, self.args.workdir + '/vsearch', threshold=0.3, glfo=self.glfo, print_time=True, vsearch_binary=self.args.vsearch_binary, get_annotations=get_annotations, no_indels=self.args.no_indels)
File "/partis/python/utils.py", line 4929, in run_vsearch
run_cmds(cmdfos)
File "/partis/python/utils.py", line 3509, in run_cmds
time.sleep(per_proc_sleep_time)
KeyboardInterrupt
My output was set to a folder on my host system (outside of Docker) so that may be what's causing the slowdown. I am running it again now with the output set within docker and appears (so far) to be working. I will also monitor the memory usage as it goes.
Attempting to install plotting packages led to a lot dependency issues so I may just hold off on that for now.
ok, great that's the same ctrl-c message I'm used to. Then I'd guess it's a memory issue, when I've gotten similar things it was the OS's out-of-memory killer killing it. I'm surprised if that's the problem, since it doesn't usually use much memory on 24k sequences, at least compared to what's on a box with 12 cores, but I don't know how much memory is really there. There's a lot of different ways to optimize/approximate for speed and memory though.
yeah, sorry about the dependency issues, I'll get the new docker image up as soon as I can. Meanwhile someone else kept track of how they got plotting working in docker last week, so this will likely fix things (the difference to what's in the manual i think is just the numpy update and the explicit list of bios2mds deps):
apt-get install -y xorg libx11-dev libglu1-mesa-dev r-cran-rgl
conda install -y -cr r-rgl r-essentials
conda update -y numpy
R --vanilla --slave -e 'install.packages(c("bios2mds","picante","pspline","deSolve","igraph","TESS","fpc","pvclust","corpcor","phytools","mvMORPH","geiger","mvtnorm","glassoFast","Rmpfr"), repos="http://cran.rstudio.com/")'
mkdir -p packages/RPANDA/lib
R CMD INSTALL -l packages/RPANDA/lib packages/RPANDA/
Some updates and another test:
I was unable to run the full fasta even with the modified output folder location (killed again). I was able to successfully partition the fasta using a subsample of 5000 sequences though so I'm guessing that memory is the issue. My machine has 32GB RAM available so I wouldn't expect for this to be a problem. Also, I have kept up task manager while running the full file and didn't notice any huge spikes in memory usage.
I also tried another file which is larger (47k sequences) to see if it was the fasta file itself or really a memory problem but before it had a chance to be killed, I got the following exception:
(base) root@af7131fba3e1:/partis# ./bin/partis partition --infname /host/home/Desktop/partis_fa_rc/lys14_rc.fasta --outfname lys14_partis_out/lys14-partition.yaml --n-procs 12 --species mouse --small-clusters-to-ignore 1-10 --paramete
r-dir lys14-full-parameter-dir
non-human species 'mouse', turning on allele clustering
parameter dir does not exist, so caching a new set of parameters before running action 'partition': lys14-full-parameter-dir
caching parameters
vsearch: 46479 / 47054 v annotations (575 failed) with 183 v genes in 31.2 sec
keeping 62 / 261 v genes
smith-waterman (new-allele clustering)
vsearch: 46444 / 47054 v annotations (610 failed) with 62 v genes in 62.9 sec
running 12 procs for 47054 seqs
Traceback (most recent call last):
File "./bin/partis", line 805, in <module>
args.func(args)
File "./bin/partis", line 261, in run_partitiondriver
parter.run(actions)
File "/partis/python/partitiondriver.py", line 125, in run
self.action_fcns[tmpaction]()
File "/partis/python/partitiondriver.py", line 264, in cache_parameters
self.run_waterer(dbg_str='new-allele clustering')
File "/partis/python/partitiondriver.py", line 221, in run_waterer
waterer.run(cachefname if write_cachefile else None)
File "/partis/python/waterer.py", line 108, in run
self.read_output(base_outfname, len(mismatches))
File "/partis/python/waterer.py", line 490, in read_output
self.summarize_query(qinfo) # returns before adding to <self.info> if it thinks we should rerun the query
File "/partis/python/waterer.py", line 979, in summarize_query
indelfo = self.combine_indels(qinfo, best) # the next time through, when we're writing ig-sw input, we look to see if each query is in <self.info['indels']>, and if it is we pass ig-sw the indel-reversed sequence, rather than the <input_info> sequence
File "/partis/python/waterer.py", line 1559, in combine_indels
return indelutils.combine_indels(regional_indelfos, full_qrseq, qrbounds, uid=qinfo['name'], debug=debug)
File "/partis/python/indelutils.py", line 645, in combine_indels
raise Exception('%sqr_gap_seq non-gap length %d not the same as qrbound length %d in %s region indelfo' % ('%s: ' % uid if uid is not None else '', utils.non_gap_len(rfo['qr_gap_seq']), qrbounds[region][1] - qrbounds[region][0], region))
Exception: a43659fe-3301-40c5-93b2-cda064707bde: qr_gap_seq non-gap length 249 not the same as qrbound length 248 in v region indelfo
I saw that there was another issue in 2018 (link) with this same exception, but it looks like it was successfully addressed. Any ideas what may be going wrong? Here is the read that causes the exception:
>a43659fe-3301-40c5-93b2-cda064707bde
GTGACTGGAGTTCAGACGTGCTCTTCCGATCTGGGGACTTCAGTGAAGATGTCCTGTAAGGCTTCTGGATACACCTTCACTAACTACTGGATAGGTTAGCAAAGCAGAGGCCTGGACATGGCCTTGAGTGGATTGGAGATATTTACCCTGGAGGTGCTTATATTAACTACAATGAAGTTCAAGGGCAAGGCCACACTGACTGCAGACAAATCCTCCAGCACAGCCTCCATGCAGTTCAGCAGCCTGACATCTGAGGACTCTGCCATCTATTACTGTGCAAGAAAGAATTACTACGGTAATACCTACTTTGACTACCGGGGCCAAGGCACCACTCAGTCTCCTCAGCC
I wasn't sure exactly which log file to look through but here is the info from my latest run on the first full fasta (~24k sequences).
Command (used --n-procs 6 instead of 12 to see if that would help):
./bin/partis partition --infname /host/home/Desktop/partis_fa_rc/ova14_rc.fasta --outfname _output/ova14_output/ova14-partition.yaml --n-procs 6 --species mouse --small-clusters-to-ignore 1-10 --parameter-dir _output/ova14_output/parameter-dir
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561# ls
cluster-path-progress germline-sets hmm_cached_info.csv hmm_input.csv istep-0
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561# cd istep-0
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561/istep-0# ls
hmm-0 hmm-1 hmm-2 hmm-3 hmm-4 hmm-5
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561/istep-0# cd hmm-5
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561/istep-0/hmm-5# ls
err hmm_cached_info.csv hmm_input.csv hmm_output.csv.progress out
(base) root@af7131fba3e1:/tmp/partis-work/hmms/274561/istep-0/hmm-5# less hmm_output.csv.progress
hmm_output.csv.progress from the latest (?) hmm folder: Google Drive text file
Is this where I should be checking for any info on why the program was killed? Nothing appears to be wrong in this file from what I can tell. The "err" file was empty.
A side question: Is there a good way to extract out the CDR3 region from the fastq file generated using "./bin/extract-fasta.py"? I see that the .yaml file includes the keys "codon_positions": {"j": 327,"v": 291} and "cdr3_length": 39 for a particular cluster but I'm not quite sure how to translate this to the final fastq file. Thanks!
argggggg that exception just won't die. I'd convinced myself that that it couldn't get triggered any more, but it looks like I'll just have to figure out a way to skip sequences that trigger it instead. I'll try to get to that tomorrow.
Yeah unfortunately running on that sequence alone doesn't reproduce the error for me, but the sequence is unproductive, so if you don't need unproductive sequences, setting --skip-unproductive may avoid the error for you.
So the log file says that particular bcrham process alone is using 7% of your memory, so multiplied by 6 is close to half, plus adding in the memory used by the python process that's spawned the bcrham procs it's likely the OOM killing it: Unfortunately the nature of clustering is that both the time and memory required are highly dependent on the structure of the repertoire (not just its size). For instance a repertoire where everybody's either super similar or very different will be quite quick and easy, but if there's tons of sequences that are similar to each other but not super close, it has to do a lot more work since the approximate methods can't do as much. Ignoring small clustersis likely to make the biggest difference in reducing the memory footprint. But oh, wait, that looks like it says it only has access to 2GB, not 32. Maybe your docker image is only getting a small allocation, that could be increased?
At the moment you'll have to add a line or two to either bin/extract-fasta.py or bin/example-parse-output.py, but another thing I may get to tomorrow is adding a command line arg to them to make it simpler to extract a single column like cdr3 length. Adding print cluster_annotation['cdr3_length']
at this point will print the cdr3 len for the largest cluster; remove the 'break' to get the rest of them. I'm not a big fan of adding meta info to fasta files, since there's so many different formats for doing so that are all mutually incompatible, but you could do a similar thing (with more work) in extract-fasta.py
Ah, I didn't realize that Docker imposes a memory cap. I actually installed Docker for the first time for this program so I'm still getting used to this whole process.
Hopefully that change will resolve the issues I'm having. I will run the larger files tonight to make sure. Thank you very much for all your help!
The memory increase appears to have solved the original "killed" issue!
I tried running the second larger file that had the length exception again, this time using the --skip-unproductive setting, but another sequence (apparently productive) also had the same exception.
Another basic question I have is for the output file from bin/example-parse-output.py. I am able to see the output of the command
./bin/example-parse-output.py --fname _output/lys6_partis_out/lys6-partition.yaml --glfo-dir /partis/data/germlines/mouse/ > _output/lys6_partis_out/parsed_output.txt
nicely using less -R parsed_output.txt
but when I transfer that folder to my working folder on my PC and try to view it, the colored text and other formatting causes the file to be very difficult to read. Example section below:
[1;34mN[0m[1;34mN[0m[1;34mN[0m[1;34mN[0m[1;34mN[0m[1;34mN[0mGAGGTGAAGCTTCTCCAGTCTGGAGGTGGCCTG[1;34m*[0m[1;34m*[0mGCAGCCT[91mT[0mGAGGATCCCTGGAAACTCTCCTGTGCAGCCTCAGGAATCGATTTTAGTAGATACTGGATGAGTT[91mA[0m[91mA[0m[91mC[0mT[1;34m*[0m[1;34m*[0mGGCGGGCTCCAGGGAAAGGACTAGAATGGATTGGAGAAATTAATCCAGATAGCAGTACAATAAACTATGCACCATCTCTAAAGGATAAATTCATCATCCTT[91mG[0mCAG[91mT[0mGACAACGCCAAAA[91mT[0m[91mA[0m[91mC[0m[91mG[0m[91mC[0m[91mT[0m[91mG[0m[91mT[0m[91mG[0m[91mT[0m[91mA[0mC[91mC[0m[91mT[0m[91mT[0m[91mC[0m[91mC[0m[91mT[0m[91mG[0m[91mC[0mA[91mA[0m[91mA[0m[91mT[0m[91mG[0mAGTGA[91mA[0mAGTGTGAGAT[91mC[0m[91mT[0m[91mG[0mGAGGACACAGCCCTTTATTAC[7mT[0m[7mG[0m[7mT[0mGCAAAAG[91mG[0mGGGCGGTTACTATGCTATGGACTAC[7mT[0m[7mG[0m[7mG[0mGGTCAAGGAA[91mA[0mC[91mC[0m[91mT[0m[91mC[0m[91mA[0m[91mG[0m[91mT[0m[91mC[0m[91mA[0mC[91mT[0m[91mG[0m[91mT[0m[91mC[0m[91mT[0mC[91mC[0m[91mT[0m[91mC[0m[91mA[0m 20ba4fc1-2906-4dc8-857e-16da8daf4e26 14.6 [91mout of frame cdr3, stop codon[0m
Is there a better way to write this file (I have tried .txt, .csv so far) and view it in Windows? Thanks!
Great, glad the memory fixed it.
I'll try to get to the exception in a bit.
Yeah so those are ansi color codes. It looks like windows terminals do support those, so maybe just view in a windows terminal? It appears there's also a windows version of less, or you can also just strip them from the log files.
ok this should fix the length exception.
Dear Partis team,
When running partition on one of my fasta files, Partis appears to spend a significant amount of time on one of the steps. I came back to the run a few hours after starting Partis and it got to the step shown below.
I tried to copy something from the command prompt, which caused Partis to be killed. I'm not sure if it was killed by something that I did, or if there was an error that wasn't causing the program to be killed until I interacted with the command prompt. Any insight on how to address this would be great!
Some additional information/questions
I am running Partis using Docker on a Windows 10 machine with 6 cores (12 logical processors). I have used the same input command for two other files (which were smaller) and it completed running in a matter of minutes.
Sorry about the title - I had another question regarding MDS (why I was copying from the command prompt in the first place) but I realized I didn't have the R package installed.