NA output on phyml_sliding_windows.py

JoaoPimenta88 commented 2 years ago

Hey there,

I am trying to run Twisst on a VCF that resulted from targeted enrichment capture. I was able to parse the VCF to geno format using parseVCF.py, but when I use phyml_sliding_windows.py, results for every window are NA. This is the command I ran on a server: python2 /home/jpimenta/software/genomics_general/phylo/phyml_sliding_windows.py -T 4 -g call_FMT_DP_8_FMT_GQ_30.geno.gz --prefix output_call_FMT_DP_8_FMT_GQ_30.phyml_bionj.w50 -w 50 --windType sites --model GTR --optimise n --tmp ./tmp --log ./log --phyml /usr/local/bin/phyml/src/phyml

I tried to install phyml locally and I also tried to run on a different server but the result was the same. Any suggestion on how to solve this problem? Could this be cause my excess of missing data?

Thank you.

simonhmartin commented 2 years ago

Hi, can you post the output from the log file?

JoaoPimenta88 commented 2 years ago

Hi Simon, after the run the log folder remained empty but this is what I get when I ran the command:

Writing final results... 233 windows were tested. 233 results were written.

I should also mention that I am using Python 2.7.5.

simonhmartin commented 2 years ago

Ok next thing to try is to add the option --test, which will just make 10 windows, and will leave the temporary files in the temprary directory. If you go and look at what's in there, you will be able to see if phyml actually ran at all or if only the alignment files were made.

JoaoPimenta88 commented 2 years ago

Inside the ./tmp folder there is another folder called 'phyml_tmp3TJLaJ' but inside this folder there is nothing.

simonhmartin commented 2 years ago

Ah ok, so the script is deciding that all windows are failing, probably due to missing data. If you have missing data in some individuals (i.e. genotype is N/N), you need to set the --minPerInd option so some value smaller than the window size. Try a very low value first. Does this solve it?

JoaoPimenta88 commented 2 years ago

Hi Simon. Setting the --minPerInd solves the problem. Thank you.

dcmain commented 2 years ago

Hi Simon, I have been having the same problem for both phyml_sliding_windows.py and raxml_sliding_windows.py. The analysis runs but the trees file contains only NAs. I have tried various solutions, I removed an individual that had a lot of missing data and I set the --minPerInd very low but I keep getting only NAs in my trees file. I am running python2.7 but I also tried running python3.6 and manipulating the scripts to make them compatible, in both cases the analysis ran without any error messages but after it had finished running there were only NAs in my trees file. I converted my VCF file to your geno format using parseVCF.py but I'm not sure if there is something wrong with my input file (attached below)?

Do you perhaps have any suggestions that might solve my problem?

Kind regards,

Devon Bradypodion_phylogeny_vcftools_cleaned_snps.geno.gz

simonhmartin commented 2 years ago

Hi Devon, have you tried adding the --log option (which prints a log file for phyml) and the --test option (which runs just 10 windows and does not delete the temporary files, so you can check how they look)?

Also - I couldn't help but notice your study genus name. A group that I love and have dreamed of working on! I would love to chat separately over email about your research, if you are interested. Simon

simonhmartin commented 2 years ago

Hi Devon, I was able to get the phyml windows script to run with your data. The first issue is because it (I assume) is reduced representation data, your SNPs are few and far between. So even with 100kb windows, most windows have 40-60 SNPs, so you would need to set bot -M and MI very low. I set window size to 1 Mb, with a minimum of 100.

The second issue I encountered was that there are some illegal characters in the geno file, specifically * This is causing phyml to fail, saying that the data is not DNA. You'll need to replace these with N first.

Finally, I see the genotypes are partially, but not completely, phased. Perhaps you have used something like whatshapp but not a probabilistic phaser (which makes sense, because phasing is not reliable for small sample sizes). This means that the phasing will be somewhat arbitrary for many sites. Since each haplotype in a diploid individual represents a separate tip in the phylogeny, using jumbled phase, especially over such a large range as 1 Mb windows, would result in a tree that doesn't represent reality, but rather represents a sort of average position of a combination of the two tips haplotypes in each individual. Personally I'm not a big fan of making trees from unphased data. I think site-based divergence statistics are more meaningful than trying to force bits of teh genome with distinct histories into a single genealogy. Happy to discuss this more over email. Simon

dcmain commented 2 years ago

Hi Simon thanks so much for your assistance with this! These are just preliminary data at the moment, I have only got 8 species in this dataset but ultimately the phylogeny will contain upwards of 20 species with multiple clades (ecomorphs) for some species. I am hoping that my SNP density will improve with increased taxon sampling and hopefully some better quality samples, at this point I just wanted to see if I could get the script to run on my dataset to see if the phylogeny looked reasonable. I ran a quick IQ-TREE analysis on a nexus file that I created from my VCF and the topology didn’t look right at all, I wanted to try the sliding windows approach to see if there was topological congruency across sites, obviously we would expect there to be differences for many sites but I wanted to try and reconcile the best species tree from my sliding windows. I’m quite new to python scripting so when I wasn’t getting any trees I wasn’t sure if the problem had to do with my input file or my scripts, I’m glad you got it to run, because now I know that it was probably my input file. Thanks so much for your assistance and suggestions, I’ll try and set a very large window size and set the minimum to 100 sites per window. I’ll also replace the illegal characters with Ns. Because of my small sample size, I didn’t actually phase the genotypes - I intended to when I had a larger dataset but I just wanted to see if I could get the script to run on the unphased dataset at this point. I realize this isn’t really best practice with phylogenetics because haploid and diploid individuals will confound the topology. I will let you know if I manage to get it running with your suggestions! Thanks so much again!

Kind regards

Devon

On Wed, 27 Jul 2022 at 16:45, Simon martin @.***> wrote:

Hi Devon, I was able to get the phyml windows script to run with your data. The first issue is because it (I assume) is reduced representation data, your SNPs are few and far between. So even with 100kb windows, most windows have 40-60 SNPs, so you would need to set bot -M and MI very low. I set window size to 1 Mb, with a minimum of 100.

The second issue I encountered was that there are some illegal characters in the geno file, specifically * This is causing phyml to fail, saying that the data is not DNA. You'll need to replace these with N first.

Finally, I see the genotypes are partially, but not completely, phased. Perhaps you have used something like whatshapp but not a probabilistic phaser (which makes sense, because phasing is not reliable for small sample sizes). This means that the phasing will be somewhat arbitrary for many sites. Since each haplotype in a diploid individual represents a separate tip in the phylogeny, using jumbled phase, especially over such a large range as 1 Mb windows, would result in a tree that doesn't represent reality, but rather represents a sort of average position of a combination of the two tips haplotypes in each individual. Personally I'm not a big fan of making trees from unphased data. I think site-based divergence statistics are more meaningful than trying to force bits of teh genome with distinct histories into a single genealogy. Happy to discuss this more over email. Simon

— Reply to this email directly, view it on GitHub https://github.com/simonhmartin/genomics_general/issues/64#issuecomment-1196855498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYFXUC5T5S2L5VMBZB7CDJ3VWFDPNANCNFSM5H2HJ72Q . You are receiving this because you commented.Message ID: @.***>

-- Kind regards,

Devon Main PhD Candidate

simonhmartin commented 2 years ago

Hi Devon, Sounds like a good plan to explore the data a bit more. Something I like to do to explore the signal with unphased diploid data is to make a NeighborNet network based on a distance matrix. I use my script distMat.py to compute distance matrices between diploids (so you get the average distance between the two pairs of haploid genomes). (Instructions here). These networks give a clear indication if there are discordant phylogenetic signals, as we saw in Figure 1 of this paper. Simon

dcmain commented 2 years ago

Hi Simon, thanks so much! I'll take a look at some networks using your script!

ghost commented 1 year ago

Dear Simon,

I am attempting to use the phyml_sliding_windows.py script in the TWISST pipeline, but I am encountering a strange error. At first I thought that my error was because of missing data so I used -M and -Mi as recommended but the error persisted. So I subsetted the genotype file with the --test flag and included the --log flag to see what was going on with phyml. I found this error . Unknown character state : '|' I am using phyml/3.3.20220715, if that helps. I look forward to hearing from you.

Update: I tested my program versions and python scripts using the phased data provided in the example directory. I think the error lies in the generation of my phased data. I will test this and post an update.

simonhmartin commented 1 year ago

Hi, yes I think you might have genotypes in there that do not match the expected A|T format. For example, indels look like AA|T, and those will likely cause an error.

ghost commented 1 year ago

Dear Simon,

It looks like my bcftools command didn't remove indels from my original run as intended. I will update this comment accordingly after Beagle finishing phasing the new filtered data. Thanks!

Update: Hi Simon, the issue with my data appears to be happening before phyml_sliding_windows.py. The parseVCF is pulling N's for every genotype call. I was wondering if you had any suggestions. I tried multiple filters in an attempt to see if it was something to do with my filtering. I attempted the most minimum filtering and I still got back N's for each genotype. I can open a new thread since this is unrelated to the phyml script. Let me know if you would like to see the data I am working with, I can send it via email.

python parseVCF.py -i noindels.w2000.ov500.vcf.gz --skipIndels --gtf flag=DP min=1

Update2 Hi Simon, are tri-allelic sites unexpected by the parseVCF script?

ghost commented 1 year ago

Dear Simon,

Apologies for starting a second message. So I removed indels and only retained biallelic sites before phasing using beagle with windows of 10k and 1k overlap. The issue that I am experiencing with phyml_sliding_windows is due to the parsing my genotypes. In the phased data, I have genotype calls but the parse script is calling all my genotypes "N's" I am not sure what I could be doing wrong here. I can email you the phased output if you'd like. I could call SNPs again to see if that's the problem, but I think something is going on with the input data after phasing. Any recommendations would be grand.

Update I made a very silly mistake. Beagle does not retain depth so parsevcf was not pulling anything from my phased files. Apologies.

MatteoSebastianelli commented 1 month ago

Hi Martin,

I am also experiencing some issues when running the phyml_sliding_windows.py script. This is the command I am running: python ./phylo/phyml_sliding_windows.py --threads 2 \ -g ./chr_vcfs/camarhynchus_propinqua_twisst_chr1A.geno.vcf.gz \ --prefix camarhynchus_propinqua_twisst_prova1A_minPerInd \ --windType sites -w 1000 --minSites 10 --minPerInd 2 --model GTR --log ./logs --test --optimise n --outgroup outgroups_propinqua.txt

It seems that the script cannot find the phyml command. From the log output I get the following: phyml command: phyml --input /scratch/phyml_tmp8if4v2au/chr1A_3265703_3448601_rk_56fnx.phy --model GTR -o n -b 0 >> ./logs sh: phyml: command not found Sending window 6 to queue. Length: 1000 Window 6 received for analysis, length: 1000

Of course, the output files .data.tsv and .trees have just NAs. Do you know how can I get phyml to work?

outgroups_propinqua.txt

simonhmartin commented 1 month ago

Hi Matteo, all I can suggest is testing whether you can run phyml from the command line. If not, you'll need to check your phyml installation.

simonhmartin / genomics_general

NA output on phyml_sliding_windows.py #64