simonhmartin / twisst

Topology weighting by iterative sampling of sub-trees
GNU General Public License v3.0
70 stars 18 forks source link

problem in getting trees by running raxml_sliding_windows.py #31

Open Niloofar-Alaei opened 2 years ago

Niloofar-Alaei commented 2 years ago

Hi Simon I want to run TWISST for my dataset. I followed your tutorial to get the trees as input for TWISST.

I have a diploid genotype and my data is not suitable for phasing, then I used your script to produce the .geno file. And then, I used raxml_sliding_windows.py to get the window trees. It ran without any error and also gave me the .data.tsv file but the .trees.gz file was included only NA. I also ran this script for the example file (msms_4of10_l50k_r500_sweep.seqgen.SNP.geno.gz), but the same happens.

I used python2 and this command:

python raxml_sliding_windows.py -g msms_4of10_l50k_r500_sweep.seqgen.SNP.geno.gz --prefix out.50 --windType sites -w 50 --model HKY85

Do you have any idea why it happened?

I also have another question: Is it possible to use the ML trees I got by using sequences (including both variant and invariant sites) in IQTREE2 as input for TWISST?

With the best and thanks Niloo

simonhmartin commented 2 years ago

Hi Niloo, You can use any trees s input for Twisst. I personally recommend using the phyml windows script, as we got better results with that approach when testing. To diagnose your problem, I recommend trying the --test option, which will cause the script to make only 10 windows, and retain the temporary files so you can check them for issues. The --log option can also help in case raxml or phyml are giving errors. I hope this helps you find the problem. Simon

Niloofar-Alaei commented 2 years ago

Thanks a lot for your response.

Then I will use the ML trees that I got from the IQtree. Because in that approach, I used the unphased data.

With the best

simonhmartin commented 2 years ago

Just to add an opinion on this. I assume you mean that you have taken each diploid individual as a single tip in the tree? Using this approach is not ideal, because actually each haploid allele inside a diploid cell has an independent evolutionary history, and could belong in an entirely different part of the genealogy. When forcing ML software to consider a diploid individual as a tip, you are violating an assumption of the tree inference, which is that each tip has a single evolutionary history. Instead, the ML inference will be trying to force two different histories into one for each individual. In practice, running Twisst on a tree in which the two alleles from each individual are forced into a single tip will probably not impact your ability to find extreme outliers: in these regions of the genome that approach a single monophyletic tree, the alleles in a single diploid individual will tend to be closely related, so it is not too unreasonable to combine them in the tree. However, I would be very cautious about interpreting intermediate topology weightings, because these could reflect trees where the ML inference did the best it could with highly inconsistent signals, potentially resulting in a tree that is quite far from the reality.

Niloofar-Alaei commented 2 years ago

Thanks, Simon, for letting me know your opinion. Yes, I have an unphased vcfFile (including both variant and invariant sites) and then use my script to define some non-overlapping windows that include IUPAC Ambiguity Codes. Then as you said, in the trees, each diploid individual is a single tip.

Then you think it's better just to use the vcf files including SNPs and then phase it. am I right?

simonhmartin commented 2 years ago

Hi Niloo, sorry for never responding to this. Yes, my recommendation is to first phase before making trees, if possible. Simon