Core gene sweeps module

thr44pw00d commented 4 years ago

Hi, thanks a lot for providing this excellent software, very useful!

I’d have a few questions about the core_gene_sweeps module. Hope it’s ok that I put them together in one issue here.

1) Empty directory project_dir/align/trees/ After running scripts 1-7 I noticed this directory was empty, and I was wondering if maybe some script did not finish? The trees are present in a file project_dir/align/phy_split/output_prefix.phy_phyml_tree.txt though, so not sure if anything is missing.

2) Re-running core_gene_sweeps with changed focus population Under usage it says:

For each population for which you wish to find sweeps, change the focus_population parameter and re-run scripts 3-7.

As script 3 (calculating the trees) was rather time-conusming, I was wondering if it’s necessary to run script 3 again, or if eventually the trees from the previous run could be used?

3) Generating output as given in Figure 5 (B and C) and Figure 6 in the paper Are you planning to include scripts that generate this output (i.e. between population Pi in the sweep regions, Fst values, SNPs in a sliding window and trees for sweep and flanking regions) in the future, or eventually some of this can already be extracted directly from the output but I missed it?

4) Very minor: If I’m not mistaken, script 4 writes phybreak.leafdist_compare.R into the project_dir but then calls it from the working directory (PopCOGenT/src/core_gene_sweeps/).

Thank you! Matthias

philarevalo commented 4 years ago

Hi Matthias,

Thanks for your comments!

1) I will look into this empty directory problem but I think @davevanins may have better insight into that.

2) I think right now it is unfortunately necessary to rerun the script, but removing this inefficiency is definitely on our to-do list.

3) At present we're not planning to release code to replicate those figures. I believe the SNPs in a sliding window are not currently extractable from the output as-is, but the Fst and Pi values should be. I will look into this in the next couple of days.

4) Thanks for pointing this out! This should be an easy fix.

thr44pw00d commented 4 years ago

Hi Phil, thank you for the response!

Regarding the empty "trees" directory, I saw that the following three lines are outcommented in script 4 (lines 153-155).

    # tree_outfile = open(tree_dir+str(tree_num)+".nwk","w")
    # tree_outfile.write(tree_string+"\n")
    # tree_outfile.close()

Maybe the trees could be written to files if wanted but it's not necessary to do so?

davevanins commented 4 years ago

Hi Matthias,

Thanks for pointing that out. During development I created that directory to test different window sizes, overlap lengths, etc.. to see how they affected tree topologies through different regions of the genome by writing them out to individual files. We'll just have to remove the lines that create the directory.
When re-running the scripts you can copy all of the files in the /align/ directory from the previous run as long as you have the same output prefix between runs. Otherwise you could rename those files to have the appropriate output prefix, or just re-run everything.

Like Phil said, we are working on improvements in a new version, particularly for efficiency. So thank you for letting us know that you would also be interested in the functionality of replicating Figure 5 (B and C) and Figure 6 in our paper, we'll look into adding a tool specifically for that in our new version.

Cheers, Dave

thr44pw00d commented 4 years ago

Alright, thank you Dave!

Xiaojun928 commented 1 year ago

Hi Dave and Matthias,

For the Q2,

As script 3 (calculating the trees) was rather time-conusming, I was wondering if it’s necessary to run script 3 again, or if eventually the trees from the previous run could be used?

I think it seems not necessary to run script 3 again, as the parameter focus_population was not used in script 3 phybreak3.MSAsubset_runPhyML.py. Though focus_population was present in lines 14, 47, 48, 91, the goal of script 3 is to run phyML to generate trees for each block, where all members of a certain main cluster were considered. Besides, focus_population has no effect on the final output as it was only used for generate the name of summary_file_name, which is not the output of this step.

Please correct me if I am wrong. Thanks a lot!

Cheers, Xiaojun

davevanins commented 1 year ago

Hello,

No, it’s not necessary to re-run them for each focus population. It’s basically a quirk of how I initially developed/tested the scripts on a standard population. I know it’s, dumb. Due to time constraints, when it came time to run analyses for the paper I opted to copy the necessary files into each focus population’s working directory instead of restructuring how all the directories are organized and troubleshoot it. I have another version that I wrote after finishing my PhD that is a lot better, and I can share that, but it doesn’t have a readme and I won’t have time to help make it work. Is that something you’d like?

Cheers, Dave

On Sun, Jan 22, 2023 at 6:51 PM Xiaojun928 @.***> wrote:

Hi Dave and Matthias,

For the Q2,

As script 3 (calculating the trees) was rather time-conusming, I was wondering if it’s necessary to run script 3 again, or if eventually the trees from the previous run could be used?

I think it seems not necessary to run script 3 again, as the parameter focus_population was not used in script 3 phybreak3.MSAsubset_runPhyML.py. Though focus_population was present in lines 14, 47, 48, 91, the goal of script 3 is to run phyML to generate trees for each block, where all members of a certain main cluster were considered. Besides, focus_population has no effect on the final output as it was only used for generate the name of summary_file_name, which is not the output of this step.

Please correct me if I am wrong. Thanks a lot!

Cheers, Xiaojun

— Reply to this email directly, view it on GitHub https://github.com/philarevalo/PopCOGenT/issues/8#issuecomment-1399644850, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANIVBTRSLEA3UGP4K6LRZETWTXBX3ANCNFSM4JMBJ5KQ . You are receiving this because you were mentioned.Message ID: @.***>

Xiaojun928 commented 1 year ago

Hi Dave,

Thanks for your reply. I think it's ok for me to use the current version.

Another thing I want to make sure of is constructing phylogenetic trees using phyml in phybreak3.MSAsubset_runPhyML.py.

The phy_command of phybreak3.MSAsubset_runPhyML.py in my case is like below:

/home-user/software/PhyML-3.1/PhyML-3.1_linux64 -i /mnt/ivy/core_sweep/align/phy_split/PB_MC1.phy -n 68088 -q -m JC69 -f e -c 2 -a 0.022 > /mnt/ivy/core_sweep/align/phy_split/PB_MC1.phy_phyml_stat.txt

Since the total number (68,088) of trees is too big, I tried mpirun -n 32 with phyml. However, it may take about one year to finish. So I separated the PB_MC1.phy into 200+ files, each with 300 alignments, and using mpirun -n 32 phyml for each separated file, which takes much shoter time. Finally, I merged the output trees (*phy_phyml_tree.txt), output *phy_phyml_stat.txt of each file into one file, respectively.

I was wondering if the merged files are appropriate for the following steps. Your suggestions are well appreciated.

Cheers, Xiaojun

WenxinHe616 commented 5 months ago

Hi Dave, Is there a new version of PopCOGenT? @davevanins Now that I have run through the three modules of this software, however, I have encountered some problems. For the core gene sweeps module, using the genome of Ruminococcus in PopCOGenT-master/src/flexible_genome_sweeps/input/genomes/Ruminococcus/, I was unable to reproduce the results in the article and I could not correspond my output results to the supplementary tables. Of course, I hope to conduct this step of analysis on my own data in the future. I think the core gene sweeps module of this software needs further description, especially the code sharing of the complete process of testing data and the interpretation of the results.

Thanks Wenxin

davevanins commented 5 months ago

Hello,

As far as I'm aware, there isn't a new version of popcogent. As for the reproducibility, without knowing more about what in particular didn't match between your output and the output from the paper it's hard to know what the cause is. One thing is that MAFFT will occasionally need to make decisions (such as where to split alignment blocks) where all options have the same score, so it makes choices based on random numbers. So, without setting a seed value for the random number generator (something I don't think MAFFT provides as an option), the output will be subtly different every time it's run, but the biological meaning of the results will be practically identical. Subtle differences in the final output was something that we ran into ourselves while writing the paper, but the location of the sweeps we detected was very consistent despite the boundaries of alignment blocks and sweeps being somewhat different.

The other thing that is important is that all of the programs that the core sweep detection part of popcogent uses have gone through large version updates since we wrote it, MAFFT in particular. So, it's possible that the version of MAFFT you're using is different enough that the output is no longer subtly different (as it was run-to-run due to random number seed differences), and this is making it harder to reproduce our analysis. I wouldn't recommend downgrading, given the huge improvements to MAFFT, but it's very likely that substantial improvements to the method are possible now.

Best, Dave

On Mon, Mar 25, 2024 at 11:27 PM WenxinHe616 @.***> wrote:

Hi Dave, Is there a new version of PopCOGenT? @davevanins https://github.com/davevanins Now that I have run through the three modules of this software, however, I have encountered some problems. For the core gene sweeps module, using the genome of Ruminococcus, I was unable to reproduce the results in the article https://www.sciencedirect.com/science/article/pii/S0092867419307366?via%3Dihub#app2because I could not correspond my output results to the supplementary tables. Of course, I hope to conduct this step of analysis on my own data in the future. I think the core gene sweeps module of this software needs further description, especially the code sharing of the complete process of testing data and the interpretation of the results.

Thanks Wenxin

— Reply to this email directly, view it on GitHub https://github.com/philarevalo/PopCOGenT/issues/8#issuecomment-2019316509, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANIVBTVDSWS6WZ4K4S6WMALY2DTLXAVCNFSM4JMBJ5K2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBRHEZTCNRVGA4Q . You are receiving this because you were mentioned.Message ID: @.***>

philarevalo / PopCOGenT

Core gene sweeps module #8