gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
269 stars 34 forks source link

Question regarding alignment output and -a pan #220

Closed jvfe closed 1 year ago

jvfe commented 1 year ago

I've noticed that, when using panaroo with the same dataset, -a core outputs core gene alignments to the core_gene_alignment.aln file, but when using -a pan this file is empty. Is this the intended behavior? Because, from reading the parameter description, I expected -a pan to output the alignment of all genes, but core_gene_alignment.aln is empty. Is there any way to override this behavior and have a complete alignment file with all genes or am I supposed to manually concatenate core and accessory genes after running panaroo?

Thanks.

nzmacalasdair commented 1 year ago

Hi, thanks for getting in touch. This is definitely not expected behaviour, and I think it may be due to some changes made when implementing the entropy filter for core genes. Could I ask if you could check if you got an error message when running panaroo with the -a pan flag? Thanks!

jvfe commented 1 year ago

Hi, here's the command I ran:

panaroo \
    -a pan --clean-mode strict --len_dif_percent 0.70 -c 0.7 -f 0.5 \
    -t 12 \
    -o results \
    -i SRR14022737.gff SRR14022764.gff SRR14022735.gff SRR14022754.gff

If I change '-a pan' above to '-a core' I get the output as normal.

Here's the STDOut of the panaroo run:

Panaroo Standard Output ```bash pre-processing gff3 files... ================================================================ Program: CD-HIT, V4.8.1 (+OpenMP), Apr 07 2021, 10:57:21 Command: cd-hit -T 12 -i results/combined_protein_CDS.fasta -o results/combined_protein_cdhit_out.txt -c 0.7 -s 0.7 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999 -g 1 -n 2 Started: Fri Feb 24 11:32:13 2023 ================================================================ Output ---------------------------------------------------------------- Your word length is 2, using 5 may be faster! total seq: 10003 longest and shortest : 2850 and 29 Total letters: 2956713 Sequences have been sorted Approximated minimal memory consumption: Sequence : 4M Buffer : 12 X 11M = 134M Table : 2 X 0M = 0M Miscellaneous : 0M Total : 139M Table limit with the given memory limit: Max number of representatives: 744016 Max number of word counting entries: 44348059 # comparing sequences from 0 to 714 ---------- new table with 220 representatives # comparing sequences from 714 to 1377 ..................... ---------- 334 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 1043 to 1683 ................... ---------- 334 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 1349 to 1967 .................... ---------- 315 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 1652 to 2248 .................... ---------- 229 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 2019 to 2589 .................... ---------- 260 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 2329 to 2877 ................... ---------- 248 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 2629 to 3155 ................... ---------- 222 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 2933 to 3438 ..................... ---------- 184 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 3254 to 3736 ................... ---------- 165 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 3571 to 4030 .................... ---------- 125 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 3905 to 4340 ..................... ---------- 118 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 4222 to 4634 ................... ---------- 101 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 4533 to 4923 .................... ---------- 77 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 4846 to 5214 .................... ---------- 87 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 5127 to 5475 .................... ---------- 31 remaining sequences to the next cycle ---------- new table with 100 representatives # comparing sequences from 5444 to 5769 ...................---------- new table with 97 representatives # comparing sequences from 5769 to 6071 ....................---------- new table with 85 representatives # comparing sequences from 6071 to 6351 ...................---------- new table with 87 representatives # comparing sequences from 6351 to 6611 ...................---------- new table with 92 representatives # comparing sequences from 6611 to 6853 ...................---------- new table with 78 representatives # comparing sequences from 6853 to 7078 ...................---------- new table with 71 representatives # comparing sequences from 7078 to 7286 ...................---------- new table with 69 representatives # comparing sequences from 7286 to 7480 ...................---------- new table with 63 representatives # comparing sequences from 7480 to 7660 ..................---------- new table with 63 representatives # comparing sequences from 7660 to 7827 ..................---------- new table with 58 representatives # comparing sequences from 7827 to 7982 ...................---------- new table with 65 representatives # comparing sequences from 7982 to 8126 ...................---------- new table with 58 representatives # comparing sequences from 8126 to 8260 ...................---------- new table with 51 representatives # comparing sequences from 8260 to 8384 ..................---------- new table with 43 representatives # comparing sequences from 8384 to 8499 ...................---------- new table with 37 representatives # comparing sequences from 8499 to 8606 ..................---------- new table with 45 representatives # comparing sequences from 8606 to 8705 ...................---------- new table with 32 representatives # comparing sequences from 8705 to 8797 ...................---------- new table with 39 representatives # comparing sequences from 8797 to 8883 .................---------- new table with 29 representatives # comparing sequences from 8883 to 8963 ................---------- new table with 29 representatives # comparing sequences from 8963 to 9037 ....................---------- new table with 27 representatives # comparing sequences from 9037 to 10003 ..................... .......... 10000 finished 3325 clusters ---------- new table with 389 representatives 10003 finished 3327 clusters Approximated maximum memory consumption: 139M writing new database writing clustering information program completed ! Total CPU time 19.97 running cmd: cd-hit -T 12 -i results/combined_protein_CDS.fasta -o results/combined_protein_cdhit_out.txt -c 0.7 -s 0.7 -aL 0.0 -AL 99999999 -aS 0.0 -AS 99999999 -M 0 -d 999 -g 1 -n 2 generating initial network... Processing paralogs... collapse mistranslations... Processing depth: 1 Iteration: 1 Iteration: 2 Iteration: 3 Iteration: 4 Processing depth: 2 Iteration: 1 Processing depth: 3 Iteration: 1 collapse gene families... Processing depth: 1 Iteration: 1 Iteration: 2 Iteration: 3 Iteration: 4 Processing depth: 2 Iteration: 1 Iteration: 2 Iteration: 3 Processing depth: 3 Iteration: 1 trimming contig ends... refinding genes... Number of searches to perform: 1107 Searching... translating hits... removing by consensus... Updating output... Number of refound genes: 81 collapse gene families with refound genes... Processing depth: 1 Iteration: 1 Iteration: 2 Processing depth: 2 Iteration: 1 Processing depth: 3 Iteration: 1 writing output... generating pan genome MSAs... ```

There doesn't appear to be any errors.

jvfe commented 1 year ago

Sorry, I forgot to say which version of Panaroo I was running - this is version 1.2.9.

nzmacalasdair commented 1 year ago

Hi, thanks for sending all of this along. I wasn't able to recreate the issue with the latest release (1.3.2) on one of my own datasets, and the fact that there wasn't an error message means this isn't related to what I thought was the issue (the entropy filter).

Would it be possible for you to update to 1.3.2 and see if this bug persists? I think we fixed a similar issue sometime recently in commit 01abc9c (1.3.0) or later.

jvfe commented 1 year ago

Oh, I just bumped the version of panaroo I was using to 1.3.2 and can confirm the bug no longer happens. Thank you!!