sungsik-kong / PhyNEST.jl

A Julia package for estimating phylogenetic networks from genomic data
MIT License
13 stars 0 forks source link

Cannot import phylip file #14

Open dylanHco opened 1 month ago

dylanHco commented 1 month ago

Hi Sungsik, Thank you for creating this program. I would love to try it out but I am having some troubles uploading my sequential phylip file.

Here is what the first few lines look like.

86 561119 A_hubrichtii_CL7_S37 -----gtacctgagtctttatcttttttttattttttttatttagataacctgagtctttatcatcatgccaatttattaagtcactcccctgatcaataccttttgtcactgtttcttgttgactgcacaatttatgttttgcatgtgtgcttaaatgttcttaaaagttggttttgataatgttttacattgcttggagctggctgctctttattttgcatgctgcttctttgac---------------------------------------ttttga

Thanks! Dylan

sungsik-kong commented 1 month ago

Hi Dylan,

I think the reason is that the sequence is arranged in multiple lines. Was your error message "ErrorException("The sequence length indicated in the first line of the file [alignment.txt] does not match with actual length for the taxon number 1 with name [A_hubrichtii_CL7_S37].")?

How about reformatting the phylip file in the following way and retry?: 86 561119 A_hubrichtii_CL7_S37 -----gtacctgagtctttatctttttttt... Another species name -----gtacctgagtctttatctttttttt... Another species name -----gtacctgagtctttatctttttttt... . . .

Hope this works!

dylanHco commented 1 month ago

So - I did not get an error message, it just was not loading the file and did not seem to make any progress when I showed the progress bar. It did work when I loaded some example data (Vanderpool2020.phy)

I used ape and this argument write.dna(mydata, "way2.phy", format="interleaved", nbcol=-1,colsep="") to convert a fasta alignment to sequential phylip. I also saw there were extra spaces after the names in my file and i have since deleted those and replaced them with a tab between name and sequence. The progress bar does not move at all. Do you think the sequences need to be upper case? How would you go about converting fasta file to phylip sequential?

sungsik-kong commented 1 month ago

Oh I understand, I think the dataset matrix is quite large to process so it is just taking long time to read in. Could you consider to cut down the number of taxa down to, say around 20, and see if it works? The network analysis for 20 taxa is already computationally heavy to estimate!

dylanHco commented 1 month ago

Yes, you are right! Its loading but very slowly... so definitely too large. Would reducing the the amount of DNA sequences also work as alternative instead of dropping samples do you think?

sungsik-kong commented 1 month ago

Unfortunately, not really. As far as I know, having more tips is not scalable during the estimation, but having more (informative) sites is actually better for the accuracy.

dylanHco commented 1 month ago

Makes sense, I have been able to reduce my dataset to just 22 tips, and same number of characters as before so now I can load the data. However I am not able to complete and runs as far as I have tried. Do you have any advice on how to submit a slurm job? I have tried to use interactive jobs for paralleization but nothing has finished yet or I get timed out. I have tried upwards to 60 cores with just h=1 and also only 1 run. I have mostly tried the hill climbing method, but now am trying simulated annealing.

sungsik-kong commented 1 month ago

Oh yes, I think it is just still quite heavy to run with 22 tips. Current version parallelizes independent runs, so having 60 cores won't speed things up if you have only 1 run. Considering you expect h=1, would you even further cut the taxa down, removing the species that you sure won't be involve in the hybridization? If you need all species in the final network, it just will take long time -- to give you a sense, network analysis can easily take days (or weeks!) -- for large dataset. (ONLY) For the sanity check if you want to see if the run is actually working at all, may be you can add an option when running <phyne!>, like or something that will end the run premature. This might give you an idea how long it will take if you run for a long time. However, do not use small value in the real analysis, because it will hurt the accuracy!

dylanHco commented 1 month ago

Thank you for the suggestions and advice! I will definitely try to subset the data to reach the 20 min number of tips. I have at least 2 tips per species (except outgroup) and I am not sure how rampant hybridization is within this group or if it occurs at all or if my data is powerful enough to detect it. I tried Dsuite and HyDe but did not get strong hybrid signal even though, one of my samples is supposedly a horticultural hybrid. I included more than one tip/species because of the one known hybrid. Do you know of instances where HyDe/abba-baba is unable to detect hybrids and network analyses are? or do both types of tests tend to pick up on hybrid signal? Thanks for your suggestions and input!

sungsik-kong commented 1 month ago

You're welcome! I believe PhyNEST should detect the signal as good as, if not better, than HyDe since PhyNEST uses all fifteen quartet site patterns for its composite likelihood computation, whereas HyDe detects hybridization using three quartet site pattern frequencies (although HyDe is one of most reliable hybrid detection methods available). However, I don't think I saw instances where HyDe/ABBA-BABA unable to detect where network analysis could. May be you could also try a global test https://github.com/rhaque62/pyghdet to see if network anaysis is neccessary for your dataset!