IQ-TREE failing for test data set

hoelzer commented 1 year ago

Hey, I run

nextflow run hoelzer-lab/ribap -r dev --fasta '../2022-06-05-cpsittaci-icarus/2022-07-cleaned-data-from-kevin/reassembled-n38/cps_02*.fasta' --cores 4 --max_cores 8 -profile local,docker -w work --output ribap-results --tree --bootstrap 1010 -resume

which worked fine until the IQ-TREE step:

Error executing process > 'iqtree'

Caused by:
  Process `iqtree` terminated with an error exit status (134)

Command executed:

  iqtree -spp core_genome.nex -bb 1010 --threads-max 4 -nt AUTO -m TEST -pre "$(basename core_genome.nex .nex)"-modeltest

Command exit status:
  134

Command error:
  OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  ERROR: phylotesting.cpp:593: std::string computeFastMLTree(Params&, Alignment*, ModelCheckpoint&, ModelsBlock*, int&, int, std::string): Assertion `subst_names.size() == rate_names.size()' failed.
  ERROR: STACK TRACE FOR DEBUGGING:
  ERROR: 1   funcAbort()
  ERROR: 2   ()
  ERROR: 3   gsignal()
  ERROR: 4   abort()
  ERROR: 5   ()
  ERROR: 6   computeFastMLTree(Params&, Alignment*, ModelCheckpoint&, ModelsBlock*, int&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
  ERROR: 7   runModelFinder(Params&, IQTree&, ModelCheckpoint&)
  ERROR: 8   runPhyloAnalysis(Params&, Checkpoint*, IQTree*&, Alignment*&)
  ERROR: 9   runPhyloAnalysis(Params&, Checkpoint*)
  ERROR: 10   main()
  ERROR: 11   __libc_start_main()
  ERROR: 12   ()
  ERROR: 
  ERROR: *** IQ-TREE CRASHES WITH SIGNAL ABORTED
  ERROR: *** For bug report please send to developers:
  ERROR: ***    Log file: core_genome-modeltest.log
  ERROR: ***    Alignment files (if possible)
  .command.sh: line 2:     9 Aborted                 (core dumped) iqtree -spp core_genome.nex -bb 1010 --threads-max 4 -nt AUTO -m TEST -pre "$(basename core_genome.nex .nex)"-modeltest

Could it be that the input genomes are too similar :D the core gene set comprises 980 MSAs.

I also attach the input FASTAs: cps-test-fastas.tar.gz

klamkiew commented 1 year ago

Getting the same issue, started some small sanity checks:

    grep '>' *.aln | cut -d":" -f2 | sort | uniq -c
    980 >cps_02DC14_polished
    980 >cps_02DC15_polished
    980 >cps_02DC16_polished
    980 >cps_02DC18_polished
    980 >cps_02DC21_polished
    980 >cps_02DC22_polished
    980 >cps_02DC23_polished
    980 >cps_02DC24_polished

So, the individual alignments have all records needed; that's not the issue.

 cat core_genome.nex | grep charset | cut -d' ' -f4 | uniq | wc -l
980

You are right, there are 928 individual gene alignments that seem to be identical; which might be a problem:

grep "singleton" core_genome-modeltest.log | sort -nk1,1 | cut -d' ' -f1 | uniq -c
# grep: leads to such a line: 0 parsimony-informative, 0 singleton sites, 334 constant sites
# sort, cut and uniq just outputs the first number
    928 0
     39 1
      1 2
      1 10
      1 19
      1 89
      1 97
      1 100
      1 153
      1 200
      1 268
      1 360
      1 378
      1 392
      1 523

The nexus file also has all alignments. Seems to be sth internal by IQTree.

If downloaded the latest (pre-)release if IQTree (which is version 2.2.2.2) directly from github, yielding the same issue. Here

I filtered core genes with cdhit -c 1.0 and just used those for the IQtree call, that have more than two clusters after the cdhit call -- runs without problems on version 2.2.2.2 and also with the version in our conda env (2.2.0.3);

I think we can omit all genes part of the core gene set that are 100% identical between all input strains - they won't contribute to anything in the first place, would they? Thus, we can most likely speed up the tree calculation in the end in many cases.

klamkiew commented 1 year ago

@hoelzer schau mal, obs tut ;)

I was now able to do the iqtree step individual with both versions and I also resumed the NF pipeline and got results. the basic.yml now has cd-hit as dependency, we might change this later down the road.

hoelzer commented 1 year ago

@klamkiew nice! Yes agree, we can omit MSAs that anyway only have completely identical sequences. Did you implement that now via CD-HIT check? Fine for me, we should then just also add it to the container.

I will test that later today when I'm back at the machine where I run into that. Thx!

klamkiew commented 1 year ago

Yep, there is now a cdhit -c 1.0 call on all MSAs; if the result yields 1 cluster, I discard it from the nexus file generation (and thus also from iqtree)

hoelzer commented 1 year ago

Sounds reasonable, but then I suggest we make a cd-hit env file and also a cd-hit container. Otherwise, it's a bit "hidden" that cdhit is used. The container I have already, and I also made sure that python=3.8 is installed as we have it in the basics.env

docker pull nanozoo/cdhit:4.8.1--c697693

I will test it then tonight : ) <3

klamkiew commented 1 year ago

alright, guess I can also do an extra env for the filtering step. At the moment I was exploiting the basic env, but you are right - makes more sense to encapsulate it the right way.

klamkiew commented 1 year ago

alright; docker container is referred to in the config and there's an extra conda env now for cdhit as well :)

hoelzer commented 1 year ago

Fixed a typo in the container config and the command itself! I have to use cd-hit instead of cdhit and then it run through! But before we close this: does this also work for you? Also when I install cd-hit via conda the command is cd-hit and not cdhit @klamkiew

klamkiew commented 1 year ago

... our system has an alias... /usr/bin/cdhit points to /usr/bin/cd-hit... didn't notice it until now. Thanks for pointing this out;

hoelzer commented 1 year ago

alright then cd-hit is correct! And it works for me

hoelzer-lab / ribap

IQ-TREE failing for test data set #42