gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
261 stars 33 forks source link

No pangenome alignment in Panaroo output folder #72

Closed martinastoycheva closed 4 years ago

martinastoycheva commented 4 years ago

Hello,

I installed panaroo through conda and I ran it twice for pan and core alignemnts but the pan genome does not produce any different output except the lack of core gene alignment file and there is no error in the log file. See command and output below command and log file copied.

Command used: $ panaroo -i ./UK_first/*.gff -o results_panaroo -t 18 --clean-mode strict -a pan --aligner mafft

Log file:

pre-processing gff3 files...

================================================================ Program: CD-HIT, V4.8.1 (+OpenMP), Oct 26 2019, 14:51:47 Command: cd-hit -T 18 -i results_panaroo/combined_protein_CDS.fasta -o results_panaroo/combined_protein_cdhit_out.txt -c 0.98 -s 0.98 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999 -g 1 -n 2

Started: Thu Aug 20 11:46:24 2020

================================================================

Output


Your word length is 2, using 5 may be faster! total seq: 763230 longest and shortest : 9624 and 29 Total letters: 253246411 Sequences have been sorted

Approximated minimal memory consumption: Sequence : 347M Buffer : 18 X 21M = 390M Table : 2 X 12M = 24M Miscellaneous : 9M Total : 771M

Table limit with the given memory limit: Max number of representatives: 582515 Max number of word counting entries: 37326436

comparing sequences from 0 to 1474 .---------- new table with 18 representatives comparing sequences from 1474 to 39561 .......... 10000 finished 79 clusters ---------- 23599 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 15962 to 53325 .......... 20000 finished 147 clusters .......... 30000 finished 215 clusters ---------- 23148 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 30177 to 66829 .......... 40000 finished 286 clusters ---------- 22265 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 44564 to 80497 .......... 50000 finished 352 clusters .......... 60000 finished 416 clusters ---------- 20383 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 60114 to 95269 .......... 70000 finished 479 clusters ---------- 19428 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 75841 to 110210 .......... 80000 finished 544 clusters .......... 90000 finished 607 clusters ---------- 19099 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 91111 to 124716 .......... 100000 finished 672 clusters ---------- 17431 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 107285 to 140082 .......... 110000 finished 739 clusters .......... 120000 finished 799 clusters ---------- 17494 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 122588 to 154620 .......... 130000 finished 861 clusters ---------- 16062 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 138558 to 169791 .......... 140000 finished 926 clusters .......... 150000 finished 992 clusters ---------- 15375 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 154416 to 184856 .......... 160000 finished 1053 clusters ---------- 15499 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 169357 to 199050 .......... 170000 finished 1118 clusters .......... 180000 finished 1183 clusters ---------- 13172 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 185878 to 214745 .......... 190000 finished 1248 clusters .......... 200000 finished 1310 clusters ---------- 14360 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 200385 to 228527 .......... 210000 finished 1376 clusters ---------- 10677 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 217850 to 245119 .......... 220000 finished 1429 clusters .......... 230000 finished 1497 clusters ---------- 11501 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 233618 to 260098 .......... 240000 finished 1561 clusters ---------- 11433 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 248665 to 274393 .......... 250000 finished 1628 clusters .......... 260000 finished 1688 clusters ---------- 9970 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 264423 to 289363 .......... 270000 finished 1757 clusters .......... 280000 finished 1815 clusters ---------- 9251 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 280112 to 304267 .......... 290000 finished 1884 clusters ---------- 8041 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 296226 to 319576 .......... 300000 finished 1945 clusters .......... 310000 finished 2009 clusters ---------- 7069 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 312507 to 335043 .......... 320000 finished 2067 clusters ---------- 7078 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 327965 to 349728 .......... 330000 finished 2135 clusters .......... 340000 finished 2204 clusters ---------- 7018 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 342710 to 363736 .......... 350000 finished 2254 clusters ---------- 4894 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 358842 to 379061 .......... 360000 finished 2319 clusters .......... 370000 finished 2374 clusters ---------- 2220 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 376841 to 396160 .......... 380000 finished 2436 clusters .......... 390000 finished 2500 clusters ---------- 3807 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 392353 to 410896 .......... 400000 finished 2567 clusters ---------- 2626 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 408270 to 426018 .......... 410000 finished 2631 clusters .......... 420000 finished 2693 clusters ---------- 1137 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 424881 to 441798 .......... 430000 finished 2746 clusters ---------- 1915 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 439883 to 456050 .......... 440000 finished 2824 clusters .......... 450000 finished 2885 clusters ---------- 1200 remaining sequences to the next cycle ---------- new table with 100 representatives comparing sequences from 454850 to 470269 .......... 460000 finished 2947 clusters .......... 470000 finished 3012 clusters ---------- new table with 94 representatives comparing sequences from 470269 to 484917 .......... 480000 finished 3074 clusters ....---------- new table with 98 representatives comparing sequences from 484917 to 498832 .......... 490000 finished 3141 clusters ........---------- new table with 86 representatives comparing sequences from 498832 to 512051 .......... 500000 finished 3207 clusters .......... 510000 finished 3264 clusters ..---------- new table with 99 representatives comparing sequences from 512051 to 524609 .......... 520000 finished 3330 clusters ....---------- new table with 62 representatives comparing sequences from 524609 to 536540 .......... 530000 finished 3393 clusters ......---------- new table with 75 representatives comparing sequences from 536540 to 547874 .......... 540000 finished 3458 clusters .......---------- new table with 77 representatives comparing sequences from 547874 to 558641 .......... 550000 finished 3516 clusters ........---------- new table with 70 representatives comparing sequences from 558641 to 568870 .......... 560000 finished 3588 clusters ........---------- new table with 62 representatives comparing sequences from 568870 to 578588 .......... 570000 finished 3652 clusters ........---------- new table with 62 representatives comparing sequences from 578588 to 587820 .......... 580000 finished 3715 clusters .......---------- new table with 59 representatives comparing sequences from 587820 to 596590 .......... 590000 finished 3780 clusters ......---------- new table with 65 representatives comparing sequences from 596590 to 604922 .......... 600000 finished 3842 clusters ....---------- new table with 50 representatives comparing sequences from 604922 to 612837 .......... 610000 finished 3911 clusters ..---------- new table with 60 representatives comparing sequences from 612837 to 620356 .......... 620000 finished 3977 clusters ---------- new table with 40 representatives comparing sequences from 620356 to 627499 ...........................---------- new table with 60 representatives comparing sequences from 627499 to 634285 .......... 630000 finished 4046 clusters ....---------- new table with 42 representatives comparing sequences from 634285 to 640732 .......... 640000 finished 4110 clusters ---------- new table with 31 representatives comparing sequences from 640732 to 646856 ..........................---------- new table with 40 representatives comparing sequences from 646856 to 652674 .......... 650000 finished 4170 clusters ..---------- new table with 42 representatives comparing sequences from 652674 to 658201 ..........................---------- new table with 30 representatives comparing sequences from 658201 to 663452 .......... 660000 finished 4239 clusters ...---------- new table with 36 representatives comparing sequences from 663452 to 668440 .........................---------- new table with 36 representatives comparing sequences from 668440 to 673179 .......... 670000 finished 4300 clusters ...---------- new table with 30 representatives comparing sequences from 673179 to 677681 ........................---------- new table with 32 representatives comparing sequences from 677681 to 681958 .......... 680000 finished 4371 clusters .---------- new table with 26 representatives comparing sequences from 681958 to 686021 .........................---------- new table with 25 representatives comparing sequences from 686021 to 689881 .......................---------- new table with 22 representatives comparing sequences from 689881 to 693548 .......... 690000 finished 4429 clusters ...---------- new table with 26 representatives comparing sequences from 693548 to 697032 ........................---------- new table with 24 representatives comparing sequences from 697032 to 700341 .......... 700000 finished 4496 clusters ---------- new table with 18 representatives comparing sequences from 700341 to 703485 .......................---------- new table with 28 representatives comparing sequences from 703485 to 706472 .......................---------- new table with 17 representatives comparing sequences from 706472 to 709309 .......................---------- new table with 22 representatives comparing sequences from 709309 to 712005 .......... 710000 finished 4564 clusters ..---------- new table with 24 representatives comparing sequences from 712005 to 714566 ......................---------- new table with 12 representatives

763230 finished 4929 clusters

Approximated maximum memory consumption: 771M writing new database writing clustering information program completed !

Total CPU time 145.12 running cmd: cd-hit -T 18 -i results_panaroo/combined_protein_CDS.fasta -o results_panaroo/combined_protein_cdhit_out.txt -c 0.98 -s 0.98 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999 -g 1 -n 2 generating initial network... Processing paralogs... collapse mistranslations... Processing depth: 1 Iteration: 1 Iteration: 2 Iteration: 3 Iteration: 4 Processing depth: 2 Iteration: 1 Iteration: 2 Processing depth: 3 Iteration: 1 collapse gene families... Processing depth: 1 Iteration: 1 Iteration: 2 Iteration: 3 Iteration: 4 Processing depth: 2 Iteration: 1 Processing depth: 3 Iteration: 1 trimming contig ends... refinding genes... Number of searches to perform: 50633 Searching... translating hits... removing by consensus... Updating output... Number of refound genes: 388 collapse gene families with refound genes... Processing depth: 1 Iteration: 1 Processing depth: 2 Iteration: 1 Processing depth: 3 Iteration: 1 writing output... generating pan genome MSAs...

============================

Job utilisation efficiency

============================

Job ID: 8978244 Cluster: viking User/Group: ----/clusterusers State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 20 CPU Utilized: 3-02:22:29 CPU Efficiency: 88.45% of 3-12:05:00 core-walltime Job Wall-clock time: 04:12:15 Memory Utilized: 2.77 GB Memory Efficiency: 1.39% of 200.00 GB Requested wall clock time: 1-00:00:00 Actual wall clock time: 04:12:15 Wall clock time efficiency: 17.5% Job queued time: 00:00:25

nzmacalasdair commented 4 years ago

Hello,

Thanks for bringing this to our attention, and sorry that you're having trouble getting a core genome alignment from panaroo. From the logfile it looks like the pan-genome inference step of panaroo has completed successfully, but something might have gone wrong in the gene alignment step.

Fortunately, you don't need to run the entire algorithm again to debug this step, as you should be able to use the command panaroo-msa to just rerun the pan or core gene alignment step.

It's hard to tell exactly what went wrong just from the log file, could you please check, for either of your runs, if the following has occured:

  1. Does the temporary directory "tmp" followed by some string of random charecters still exist for either run? If so, are there are .fasta files for each gene in that directory?
  2. Has an aligned_gene_sequences directory been created for either run, and if so, does it contain aligned FASTA files, .aln.fas for each gene?

Also, I see that you're running this on a cluster with some kind of job scheduler. MAFFT output relies on STDOUT redirection using the >redirection operator, so I wonder if that might be causing some issues? It may be worth trying to performing the alignment using clustal. You could do this quickly for either run you've already done, using panaroo-msa. The options should be more or less identical to the main panaroo executable, but it also has its own help message.

Thanks!

martinastoycheva commented 4 years ago

Hello,

Thank you for you prompt response! Sorry about the awful formatting of my first comment!

I. To answer your questions first:

  1. The "tmp" file is no longer in the panaroo output directory but I saw another tmp being created when I started the panaroo-msa run.
  2. The aligned_gene_sequences folder does exist and has all the the gene alignments separately.

II. Some additional comments:

  1. Core alignment is produced but the pan is missing. Perhaps I have misunderstood the output as I do get a core alignment with options --aligner mafft and -a core but with option -a pan there is no alignment output. Is this supposed to happen?
  2. The prank alignment option leads to a failed alignment. I may have I misunderstood something about the installation. I used conda installation to add prank to the conda environment I created for panaroo. See environment version history below and command + error log after:

CONDA ENV VERSION VERSION HISTORY:

'2020-08-17 15:51:04 (rev 0) 2020-08-17 16:05:57 (rev 1) +_libgcc_mutex-0.1 (conda-forge) +_openmp_mutex-4.5 (conda-forge) +alsa-lib-1.2.3 (conda-forge) +aragorn-1.2.38 (bioconda) +argcomplete-1.12.0 (conda-forge) +argh-0.26.2 (conda-forge) +barrnap-0.9 (bioconda) +bedtools-2.29.2 (bioconda) +biopython-1.77 (conda-forge) +blast-2.10.1 (bioconda) +bzip2-1.0.8 (conda-forge) +c-ares-1.11.0 (bioconda) +ca-certificates-2020.6.20 (conda-forge) +cairo-1.16.0 (conda-forge) +capnproto-0.6.1 (conda-forge) +cd-hit-4.8.1 (bioconda) +certifi-2020.6.20 (conda-forge) +curl-7.71.1 (conda-forge) +cycler-0.10.0 (conda-forge) +decorator-4.4.2 (conda-forge) +dendropy-4.4.0 (bioconda) +entrez-direct-13.3 (bioconda) +expat-2.2.9 (conda-forge) +fontconfig-2.13.1 (conda-forge) +freetype-2.10.2 (conda-forge) +gettext-0.19.8.1 (conda-forge) +gffutils-0.10.1 (bioconda) +giflib-5.2.1 (conda-forge) +glib-2.65.0 (conda-forge) +graphite2-1.3.13 (conda-forge) +gsl-2.6 (conda-forge) +harfbuzz-2.7.1 (conda-forge) +hmmer-3.3 (bioconda) +icu-67.1 (conda-forge) +importlib-metadata-1.7.0 (conda-forge) +importlib_metadata-1.7.0 (conda-forge) +infernal-1.1.2 (bioconda) +intbitset-2.4.0 (conda-forge) +intel-openmp-2020.1 +joblib-0.16.0 (conda-forge) +jpeg-9d (conda-forge) +kiwisolver-1.2.0 (conda-forge) +krb5-1.17.1 (conda-forge) +lcms2-2.11 (conda-forge) +ld_impl_linux-64-2.34 (conda-forge) +libblas-3.8.0 (conda-forge) +libcblas-3.8.0 (conda-forge) +libcurl-7.71.1 (conda-forge) +libedit-3.1.20191231 (conda-forge) +libev-4.33 (conda-forge) +libffi-3.2.1 (conda-forge) +libgcc-7.2.0 (conda-forge) +libgcc-ng-9.3.0 (conda-forge) +libgfortran-ng-7.5.0 (conda-forge) +libgomp-9.3.0 (conda-forge) +libiconv-1.15 (conda-forge) +libidn11-1.34 (conda-forge) +liblapack-3.8.0 (conda-forge) +libllvm9-9.0.1 (conda-forge) +libnghttp2-1.41.0 (conda-forge) +libopenblas-0.3.10 (conda-forge) +libpng-1.6.37 (conda-forge) +libssh2-1.9.0 (conda-forge) +libstdcxx-ng-9.3.0 (conda-forge) +libtiff-4.1.0 (conda-forge) +libuuid-2.32.1 (conda-forge) +libwebp-base-1.1.0 (conda-forge) +libxcb-1.13 (conda-forge) +libxml2-2.9.10 (conda-forge) +llvm-meta-7.0.0 (conda-forge) +llvmlite-0.33.0 (conda-forge) +lz4-c-1.9.2 (conda-forge) +mafft-7.471 (bioconda) +mash-2.2.2 (bioconda) +matplotlib-base-3.3.1 (conda-forge) +minced-0.4.2 (bioconda) +mkl-2020.1 +ncurses-6.2 (conda-forge) +networkx-2.4 (conda-forge) +numba-0.50.1 (conda-forge) +numpy-1.19.1 (conda-forge) +olefile-0.46 (conda-forge) +openjdk-11.0.8 (conda-forge) +openmp-7.0.0 (conda-forge) +openssl-1.1.1g (conda-forge) +panaroo-1.2.3 (bioconda) +parallel-20160622 (bioconda) +pcre-8.44 (conda-forge) +perl-5.26.2 (conda-forge) +perl-app-cpanminus-1.7044 (bioconda) +perl-archive-tar-2.32 (bioconda) +perl-base-2.23 (bioconda) +perl-bioperl-1.6.924 (bioconda) +perl-business-isbn-3.004 (bioconda) +perl-business-isbn-data-20140910.003 (bioconda) +perl-carp-1.38 (bioconda) +perl-common-sense-3.74 (bioconda) +perl-compress-raw-bzip2-2.087 (bioconda) +perl-compress-raw-zlib-2.087 (bioconda) +perl-constant-1.33 (bioconda) +perl-data-dumper-2.173 (bioconda) +perl-digest-hmac-1.03 (bioconda) +perl-digest-md5-2.55 (bioconda) +perl-encode-2.88 (bioconda) +perl-encode-locale-1.05 (bioconda) +perl-exporter-5.72 (bioconda) +perl-exporter-tiny-1.002001 (bioconda) +perl-extutils-makemaker-7.36 (bioconda) +perl-file-listing-6.04 (bioconda) +perl-file-path-2.16 (bioconda) +perl-file-temp-0.2304 (bioconda) +perl-html-parser-3.72 (bioconda) +perl-html-tagset-3.20 (bioconda) +perl-html-tree-5.07 (bioconda) +perl-http-cookies-6.04 (bioconda) +perl-http-daemon-6.01 (bioconda) +perl-http-date-6.02 (bioconda) +perl-http-message-6.18 (bioconda) +perl-http-negotiate-6.01 (bioconda) +perl-io-compress-2.087 (bioconda) +perl-io-html-1.001 (bioconda) +perl-io-socket-ssl-2.066 (bioconda) +perl-io-zlib-1.10 (bioconda) +perl-json-4.02 (bioconda) +perl-json-xs-2.34 (bioconda) +perl-libwww-perl-6.39 (bioconda) +perl-list-moreutils-0.428 (bioconda) +perl-list-moreutils-xs-0.428 (bioconda) +perl-lwp-mediatypes-6.04 (bioconda) +perl-lwp-protocol-https-6.07 (bioconda) +perl-mime-base64-3.15 (bioconda) +perl-mozilla-ca-20180117 (bioconda) +perl-net-http-6.19 (bioconda) +perl-net-ssleay-1.88 (bioconda) +perl-ntlm-1.09 (bioconda) +perl-parent-0.236 (bioconda) +perl-pathtools-3.75 (bioconda) +perl-scalar-list-utils-1.52 (bioconda) +perl-socket-2.027 (bioconda) +perl-storable-3.15 (bioconda) +perl-test-requiresinternet-0.05 (bioconda) +perl-threaded-5.26.0 (bioconda) +perl-time-local-1.28 (bioconda) +perl-try-tiny-0.30 (bioconda) +perl-types-serialiser-1.0 (bioconda) +perl-uri-1.76 (bioconda) +perl-www-robotrules-6.02 (bioconda) +perl-xml-namespacesupport-1.12 (bioconda) +perl-xml-parser-2.44 (bioconda) +perl-xml-sax-1.02 (bioconda) +perl-xml-sax-base-1.09 (bioconda) +perl-xml-sax-expat-0.51 (bioconda) +perl-xml-simple-2.25 (bioconda) +perl-xsloader-0.24 (bioconda) +perl-yaml-1.29 (bioconda) +pillow-7.2.0 (conda-forge) +pip-20.2.2 (conda-forge) +pixman-0.38.0 (conda-forge) +plotly-4.9.0 (conda-forge) +prank-v.170427 (bioconda) +prodigal-2.6.3 (bioconda) +prokka-1.13 (bioconda) +pthread-stubs-0.4 (conda-forge) +pyfaidx-0.5.9.1 (bioconda) +pyparsing-2.4.7 (conda-forge) +python-3.6.11 (conda-forge) +python-dateutil-2.8.1 (conda-forge) +python-edlib-1.3.8.post1 (bioconda) +python_abi-3.6 (conda-forge) +readline-8.0 (conda-forge) +retrying-1.3.3 (conda-forge) +scikit-learn-0.23.2 (conda-forge) +scipy-1.5.2 (conda-forge) +setuptools-49.6.0 (conda-forge) +simplejson-3.8.1 (bioconda) +six-1.15.0 (conda-forge) +sqlite-3.32.3 (conda-forge) +tbl2asn-25.7 (bioconda) +threadpoolctl-2.1.0 (conda-forge) +tk-8.6.10 (conda-forge) +tornado-6.0.4 (conda-forge) +tqdm-4.7.2 (bioconda) +wheel-0.35.1 (conda-forge) +xorg-fixesproto-5.0 (conda-forge) +xorg-inputproto-2.3.2 (conda-forge) +xorg-kbproto-1.0.7 (conda-forge) +xorg-libice-1.0.10 (conda-forge) +xorg-libsm-1.2.3 (conda-forge) +xorg-libx11-1.6.11 (conda-forge) +xorg-libxau-1.0.9 (conda-forge) +xorg-libxdmcp-1.1.3 (conda-forge) +xorg-libxext-1.3.4 (conda-forge) +xorg-libxfixes-5.0.3 (conda-forge) +xorg-libxi-1.7.10 (conda-forge) +xorg-libxrender-0.9.10 (conda-forge) +xorg-libxtst-1.2.3 (conda-forge) +xorg-recordproto-1.14.2 (conda-forge) +xorg-renderproto-0.11.1 (conda-forge) +xorg-xextproto-7.3.0 (conda-forge) +xorg-xproto-7.0.31 (conda-forge) +xz-5.2.5 (conda-forge) +zipp-3.1.0 (conda-forge) +zlib-1.2.11 (conda-forge) +zstd-1.4.5 (conda-forge)

2020-08-20 12:01:33 (rev 2) +argtable2-2.13 (conda-forge) +clustalo-1.2.4 (bioconda)

2020-08-21 15:40:22 (rev 3) +iqtree-2.0.3 (bioconda)`

COMMAND and ERROR LOG:

running cmd: cd-hit -T 18 -i results_panaroo_prank_strict/combined_protein_CDS.fasta -o results_panaroo_prank_strict/combined_protein_cdhit_out.txt -c 0.98 -s 0.98 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999 -g 1 -n 2 generating initial network... Processing paralogs... collapse mistranslations... Processing depth: 1 Iteration: 1 Iteration: 2 Iteration: 3 Iteration: 4 Processing depth: 2 Iteration: 1 Iteration: 2 Processing depth: 3 Iteration: 1 collapse gene families... Processing depth: 1 Iteration: 1 Iteration: 2 Iteration: 3 Iteration: 4 Processing depth: 2 Iteration: 1 Processing depth: 3 Iteration: 1 trimming contig ends... refinding genes... Number of searches to perform: 50633 Searching... translating hits... removing by consensus... Updating output... Number of refound genes: 388 collapse gene families with refound genes... Processing depth: 1 Iteration: 1 Processing depth: 2 Iteration: 1 Processing depth: 3 Iteration: 1 writing output... generating pan genome MSAs...

100%|██████████| 10/10 [01:02<00:00, 5.69s/it] 100%|██████████| 12/12 [00:09<00:00, 1.31it/s] 100%|██████████| 5043/5043 [00:00<00:00, 16513.81it/s] 100%|██████████| 204/204 [00:00<00:00, 9286.58it/s] 100%|██████████| 4/4 [00:00<00:00, 2365.65it/s] 100%|██████████| 1/1 [00:00<00:00, 69905.07it/s] 100%|██████████| 4754/4754 [00:00<00:00, 43946.23it/s] 100%|██████████| 3/3 [00:00<00:00, 1935.24it/s] 100%|██████████| 4751/4751 [00:00<00:00, 41038.65it/s] 100%|██████████| 4751/4751 [00:00<00:00, 43185.52it/s] 100%|██████████| 17/17 [00:00<00:00, 4011.66it/s] 100%|██████████| 1/1 [00:00<00:00, 900.07it/s] 100%|██████████| 1/1 [00:00<00:00, 65536.00it/s] 100%|██████████| 4731/4731 [00:00<00:00, 43452.82it/s] 100%|██████████| 4731/4731 [00:00<00:00, 40978.30it/s] 170it [00:29, 5.72it/s] 100%|██████████| 4651/4651 [00:00<00:00, 59496.85it/s] 100%|██████████| 4651/4651 [00:00<00:00, 54962.59it/s] 100%|██████████| 4651/4651 [00:00<00:00, 52128.70it/s] 100%|██████████| 4651/4651 [49:02<00:00, 1.58it/s] 2%|▏ | 72/4621 [5:22:17<376:54:57, 298.28s/it]Traceback (most recent call last): File "/users/mms565/scratch/conda/envs/panaroo/bin/panaroo", line 10, in sys.exit(main()) File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/site-packages/panaroo/main.py", line 463, in main args.alr, isolate_names) File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/site-packages/panaroo/generate_output.py", line 215, in generate_pan_genome_alignment threads, aligner) File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/site-packages/panaroo/generate_alignments.py", line 158, in multi_align_sequences delayed(align_sequences)(x, outdir, aligner) for x in tqdm(commands)) File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/site-packages/joblib/parallel.py", line 1042, in call self.retrieve() File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/site-packages/joblib/parallel.py", line 921, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, *kwds)) File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 595, in call return self.func(args, **kwargs) File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/site-packages/joblib/parallel.py", line 253, in call for func, args, kwargs in self.items] File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/site-packages/joblib/parallel.py", line 253, in for func, args, kwargs in self.items] File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/site-packages/panaroo/generate_alignments.py", line 147, in align_sequences stdout, stderr = command[0]() File "/users/mms565/scratch/conda/envs/panaroo/lib/python3.6/site-packages/Bio/Application/init.py", line 569, in call raise ApplicationError(return_code, str(self), stdout_str, stderr_str) Bio.Application.ApplicationError: Non-zero return code -6 from 'prank -d=results_panaroo_prank_strict/tmp5z7p4a3k/addB.fasta -o=addB -f=8 -codon', message "prank: intmatrix.h:66: int IntMatrix::g(int, int, int, int): Assertion xa<x' failed."

nzmacalasdair commented 4 years ago

Hello!

Thanks for getting back with the responses. If alignments for all core/pan genes were created in the "aligned_gene_sequences" folder, then the problem seems to to be in the concatenation step.

Can I ask if you got the core_genome_alignment file on your first run using MAFFT as the aligner, or was it only when re-running with clustal?

Panaroo is meant to create a core_genome_alignment file regardless of whether the option provided to -a is core or pan. Am I correct in understanding that this was only produced when you used the flag -a core? This is not expected behaviour and I will have a look at fixing it.

From your message, I'm also not sure if you're expecting a pan_genome_alignment file? Panaroo currently does not produce this file as it is both very large and we didn't think it would be of substantial interest to users. Would this be something you are interested in? I am currently reworking some bits of the alignment code and could possibly add this feature if it would be useful.

I haven't seen that PRANK error before, it looks like it is occuring on the addB gene, is this the first gene it tries to align, or does it manage to align some others before encountering this error? Would you mind sharing the unaligned addB.fasta file so I can try to recreate this error?

Many thanks!

martinastoycheva commented 4 years ago

Hello,

  1. I only get core alignment with --aligner mafft and -a core flags none of the other combinations seem to produce any alignments when ran with panaroo command. However, panaroo-msa produces an alignment with mafft and clustal, and I will try it with prank now and let you know whether it works fine.

  2. I believe that a pan_genome_alignement file would be a useful feature.

  3. PRANK does manage to align a few genes and stops at addB gene for some reason. I have attached the un aligned addB.fasta. addB.fasta.zip

  4. I have asked the cluster admins to install panaroo and dependencies so that I can see whether the strange behaviour persists when I am not using a conda environment, as Slurm and conda seem to behave oddly together sometimes.

gtonkinhill commented 4 years ago

Sorry for the slow response.

  1. I'm not sure what is going wrong here. The panaroo-msa script should just run the final stages of the main panaroo pipeline. So I'm not sure why one would work and the other not. Could something have caused the main pipeline to end early? Perhaps a time limit on the server?

  2. I think this alignment could get very large and in many dataset would mostly be made up of missing sites due the large number of accessory genes. I'll have more of a think about it but I'm not sure what the use case would be at the moment.

  3. If an old version of prank is installed it can sometimes interfere with the conda installation (see #62). The old version of prank has a bug which can cause the alignment to crash on some sequences.

martinastoycheva commented 4 years ago

Hello,

I saw that there is an update on the panaroo conda package and that fixed both of the issues: prank and core genome alignment.

Thank you very much!