lh3 / psmc

Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model
Other
146 stars 60 forks source link

combining two plots #26

Open davidaray opened 3 years ago

davidaray commented 3 years ago

I have two independently produced plots for two related species and would like to combine them onto a single set of axes for comparison. I see this in multiple manuscripts but I don't know how to do it for my data.

Does anyone have some clear instructions on how to accomplish this?

I've tried just about every variation of psmc_plot.pl -M "geomys=0.1,thomomys=0.2" -u "geomys=0.0000000022,thomomys=0.0000000022" -g "geomys=3,thomomys=2" twogophers geomys.psmc thomomys.psmc possible with no luck.

David

laiqiang0728 commented 3 years ago

Hi, I'm on the same issue, have you got any methods to solve this ? Thanks!

Qiang

davidaray commented 3 years ago

Hi, I'm on the same issue, have you got any methods to solve this ? Thanks!

Qiang

I didn't get anything from the author but I did get some help from another researcher. It involves taking some output data from psmc and then plotting it with Excel or another software package. It's rather easy and I'd be happy to share it with you if you contact me directly.

davidaray commented 3 years ago

For anyone interested, here is a workaround for plotting via psmc directly:

This is the documentation and commands from the script I ran. If you're familiar with linux, you should be able to work out the steps. If you have trouble, let me know and I'll try to explain what I did in each step.

The first step is to run the analysis just once so that you get a single set of data to serve as the main line in the plot.

fq2psmcfa geomys.fq.gz >geomys.single.psmcfa

psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o geomys.single.psmc geomys.psmcfa

psmc_plot.pl -u 2.2e-9 -g 3.08 -R -p geomys.plotsingle.psmc geomys.psmc

Note the -R option. It produces a .txt file with the data to be plotted.

Afterwards, you will do the bootstrap that is described in the psmc documentation.

Split sequences to perform bootstrapping

splitfa $PREFIX".psmcfa" > $PREFIX".split.psmcfa"

Run PSMC bootstrap , using the default options

psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o $PREFIX".psmc" $PREFIX".psmcfa"

psmc -N25 -t15 -r5 -b -p "4+25*2+4+6" -o round-{}.psmc $PREFIX"2"$MAP2".split.psmcfa" | sh

seq 100 | parallel -I% --max-args 1 $PSMC/psmc -N25 -t15 -r5 -b -p "4+25*2+4+6" -o round-%.psmc $PREFIX".split.psmcfa" | sh

cat $PREFIX".psmc" round-*.psmc > $PREFIX".combined.psmc"

rm round-*.psmc

Generate PSMC plot, using the per-generation mutation rate -u and the generation time in years -g for each bootstrap iteration.

psmc_plot.pl -u 2.2e-9 -g $GT -R -p $PREFIX".plot.psmc" $PREFIX".combined.psmc"

Again note the -R option.

Concatenate the .txt files from the bootstrap analysis, leaving one space between each iteration's data.

Use excel to plot the data as described below

I created an excel file that has the concatenated data in columns B-F for taxon 1 and columns I-M for taxon 2. The single-run data are in columns P-T for taxon 1 and W-AA for taxon 2.

  1. Open your excel file and highlight columns B and C.

  2. Click the 'Insert' toolbar and choose 'Recommended Charts'.

  3. Select a 'Scatter' plot.

  4. Right click the x-axis and choose 'Format Axis' change the Axis Options to 'Logarithmic scale'.

  5. Right click any plotted point and choose 'Format Data Series'.

  6. Click the paint can and choose 'Marker' and then 'Marker Options'. Change to 'None'.

  7. Click 'Line' and choose 'Solid line'. Change the transparency and color to whatever suits you. I generally choose a light color.

  8. Right click anywhere in the plotted area and choose 'Select Data'.

  9. Click Series 1 and select 'Edit'. Change the name to 'bootstrap' or whatever you deem appropriate. Don't change any of the other fields.

  10. Click 'Add'. Name the series, 'main' or whatever is appropriate.

  11. Click the Series X box and then select column P.

  12. Click the Series Y box, delete what's there and then select column Q.

  13. Adjust the color for the new line as needed. i usually choose a darker version of the color that I used for the bootstrap.

  14. To add your second data set,, Right click in the plotted area and choose 'Add'. Name the new series 'boostrap' and select column I for Series X and column J for Series Y.

  15. Add the 'main' line as described in steps 10 - 13.

The excel file I created is located here: https://github.com/davidaray/test/blob/master/Pocket_gopher.xlsx

fredjaya commented 3 years ago

I provided a solution on a previous issue which utilises the multiline mode in psmc_plot.pl - might find it helpful.

JuliaLopezDelgado commented 2 years ago

Thanks for the instructions on the combined bootstrap plot @davidaray !

Just one question - is the estimated effective population size (e.g. column C in your data for Geomys) meant to be multiplied by 1,000 (e^03) or by 10,000 (e^04)? And, is this always the case or is it data-specific? I'm unsure about the units in my output .txt file and I cannot find any official documentation about this.

Thanks in advance, Julia

davidaray commented 2 years ago

Thanks for the instructions on the combined bootstrap plot @davidaray !

Just one question - is the estimated effective population size (e.g. column C in your data for Geomys) meant to be multiplied by 1,000 (e^03) or by 10,000 (e^04)? And, is this always the case or is it data-specific? I'm unsure about the units in my output .txt file and I cannot find any official documentation about this.

Thanks in advance, Julia

Julia, I saw this last week and forgot to respond. I apologize. Did you find an answer? I honestly don't remember and our HPCC, where all this work is housed, is down for maintenance this week. I won't be able to access until Monday.

JuliaLopezDelgado commented 2 years ago

Thanks for your reply @davidaray! No, I still am unsure about the psmc output. Any insight would be really appreciated.

g-pacheco commented 2 years ago

Hello,

I also have the .txt files and I have been trying to find a way to plot these results using ggplot. As it has been said, it would be lovely to understand what are each of the 5 columns in the .txt output, and which one and how should be scaled.

I have tried to scale (X 1,000) the fifth column, but my numbers simply do not match... so I suppose things are a bit more complicated than that. It would be great to be able to plot these results with ggplot though.

Has anyone been luckier?

Many thanks in advance, George.

elizeng commented 2 years ago

Thanks @davidaray for the helpful instructions! In the example excel file you have provided, I note that you only use the first 2 columns of the text output which are years from present (X axis) and effective population (Y axis). Any idea what the remaining 3 columns are?

elizeng commented 2 years ago

Hello,

I also have the .txt files and I have been trying to find a way to plot these results using ggplot. As it has been said, it would be lovely to understand what are each of the 5 columns in the .txt output, and which one and how should be scaled.

I have tried to scale (X 1,000) the fifth column, but my numbers simply do not match... so I suppose things are a bit more complicated than that. It would be great to be able to plot these results with ggplot though.

Has anyone been luckier?

Many thanks in advance, George.

You can probably plot it using the first 2 columns years from present (X axis - 1st column) and effective population (Y axis - 2nd column). Not sure what the rest of the columns refer to

linhai1221 commented 2 years ago

include the "-R" in the cmd, and vim the .gp file as you like then run cat sample_plot.gp | gnuplot