Closed hoelzer closed 1 year ago
Getting the same issue, started some small sanity checks:
grep '>' *.aln | cut -d":" -f2 | sort | uniq -c
980 >cps_02DC14_polished
980 >cps_02DC15_polished
980 >cps_02DC16_polished
980 >cps_02DC18_polished
980 >cps_02DC21_polished
980 >cps_02DC22_polished
980 >cps_02DC23_polished
980 >cps_02DC24_polished
So, the individual alignments have all records needed; that's not the issue.
cat core_genome.nex | grep charset | cut -d' ' -f4 | uniq | wc -l
980
You are right, there are 928 individual gene alignments that seem to be identical; which might be a problem:
grep "singleton" core_genome-modeltest.log | sort -nk1,1 | cut -d' ' -f1 | uniq -c
# grep: leads to such a line: 0 parsimony-informative, 0 singleton sites, 334 constant sites
# sort, cut and uniq just outputs the first number
928 0
39 1
1 2
1 10
1 19
1 89
1 97
1 100
1 153
1 200
1 268
1 360
1 378
1 392
1 523
The nexus file also has all alignments. Seems to be sth internal by IQTree.
If downloaded the latest (pre-)release if IQTree (which is version 2.2.2.2) directly from github, yielding the same issue. Here
I filtered core genes with cdhit -c 1.0
and just used those for the IQtree call, that have more than two clusters after the cdhit call -- runs without problems on version 2.2.2.2 and also with the version in our conda env (2.2.0.3);
I think we can omit all genes part of the core gene set that are 100% identical between all input strains - they won't contribute to anything in the first place, would they? Thus, we can most likely speed up the tree calculation in the end in many cases.
@hoelzer schau mal, obs tut ;)
I was now able to do the iqtree step individual with both versions and I also resumed the NF pipeline and got results. the basic.yml
now has cd-hit as dependency, we might change this later down the road.
@klamkiew nice! Yes agree, we can omit MSAs that anyway only have completely identical sequences. Did you implement that now via CD-HIT check? Fine for me, we should then just also add it to the container.
I will test that later today when I'm back at the machine where I run into that. Thx!
Yep, there is now a cdhit -c 1.0 call on all MSAs; if the result yields 1 cluster, I discard it from the nexus file generation (and thus also from iqtree)
Sounds reasonable, but then I suggest we make a cd-hit env file and also a cd-hit container. Otherwise, it's a bit "hidden" that cdhit is used. The container I have already, and I also made sure that python=3.8 is installed as we have it in the basics.env
docker pull nanozoo/cdhit:4.8.1--c697693
I will test it then tonight : ) <3
alright, guess I can also do an extra env for the filtering step. At the moment I was exploiting the basic env, but you are right - makes more sense to encapsulate it the right way.
alright; docker container is referred to in the config and there's an extra conda env now for cdhit as well :)
Fixed a typo in the container config and the command itself! I have to use cd-hit
instead of cdhit
and then it run through! But before we close this: does this also work for you? Also when I install cd-hit via conda the command is cd-hit
and not cdhit
@klamkiew
... our system has an alias... /usr/bin/cdhit points to /usr/bin/cd-hit... didn't notice it until now. Thanks for pointing this out;
alright then cd-hit
is correct! And it works for me
Hey, I run
which worked fine until the IQ-TREE step:
Could it be that the input genomes are too similar :D the core gene set comprises 980 MSAs.
I also attach the input FASTAs: cps-test-fastas.tar.gz