petrelharp / local_pca

Methods for examining PCA locally along the genome.
71 stars 13 forks source link

Displaying results error #25

Closed sjfleck closed 3 years ago

sjfleck commented 3 years ago

I'm using local_pca with 41 closely related species. The reference that everything is mapped to has 32 chromosomes, so I used tabix to divide the vcf file into separate files for each of the largest scaffolds (i.e.) tabix -h selectedSNPswREFSNPOnly.vcf.gz HiC_scaffold_1 > selectedSNPswREFSNPOnly.HiC_scaffold_1.vcf

next I converted the vcf files to bcf (i.e.): bcftools convert -O b selectedSNPswREFSNPOnly.HiC_scaffold_1.vcf > selectedSNPswREFSNPOnly.HiC_scaffold_1.bcf

and finally, I indexed the files (i.e.): bcftools index selectedSNPswREFSNPOnly.HiC_scaffold_1.bcf

I now had 32 bcf and bcf.csi files in my data directory. I also created a sample_info.tsv file with only "ID" and "population" columns and put it in the data directory. As a test, I decided to work with the four smallest bcf files, representing snps from scaffolds 29, 30, 31, and 32.

Previously, I was working with the medicago data and input my data instead, so here are the stats for these four scaffolds:

HiC_scaffold_29 HiC_scaffold_30 nsnps 1,945,193 2,044,694 nbp.HiC_scaffold_1 28266565 27710915 spacing.mean 14 13 spacing.5% 1 1 spacing.25% 2 2 spacing.50% 4 4 spacing.75% 10 10 spacing.95% 43 43 spacing.max 23964 23423

HiC_scaffold_31 HiC_scaffold_32
nsnps 1,660,526 1,496,728 nbp.HiC_scaffold_1 26516423 23710698 spacing.mean 15 15 spacing.5% 1 1 spacing.25% 2 2 spacing.50% 5 5 spacing.75% 12 11 spacing.95% 54 47 spacing.max 13397 16212

In the Medicago example, i think -s of 5,000 was chosen because there was 5 million snps in chromosome 1. I chose an -s of 1,500 because there were roughly 1.5 million snps in my smallest scaffold. Here was my command: /Applications/local_pca/templated/run_lostruct.R -i data1 -t snp -s 1500 -I data1/sample_info_genus.tsv

and to visualize my results, I ran this command: Rscript -e 'templater::render_template("/Applications/local_pca/templated/summarize_run.Rmd",output="lostruct_results/type_snp_size_1500_weights_none_jobid_023972/run_summary.html",change.rootdir=TRUE)'

For some reason, each "plot_corner_pca" pdf only show 3 plots for my results even though I'm working with 4 chromosomes/scaffolds. Additionally, mds_pairplot-1.png is blank and a pdf was not created for that plot. I will share the html file here so you can take a look at my results. Any help would be greatly appreciated. Thanks

file:///Users/albertlab/Desktop/Tembusu/local_PCA/Tembusu/lostruct_results/type_snp_size_1500_weights_none_jobid_023972/run_summary.html

run_summary.html.txt

petrelharp commented 3 years ago

Hm, let's see. Thanks for the report! That sounds like a fun example.

For some reason, each "plot_corner_pca" pdf only show 3 plots for my results even though I'm working with 4 chromosomes/scaffolds

Ah-ha. This is because each plot_corner_pca_X.pdf plot has one plot for each corner, not each chromosome. if you want to make plots separately for your separate chromsomes (which will do MDS separately on each chromosome) then you need to run lostruct separately on each.

Additionally, mds_pairplot-1.png is blank

Ah-ha. This is another problem stemming from Rv3 changing the default value of stringsAsFactors. I've fixed this up so if you re-install (or just make these changes) then it should work fine.

BTW, that same plot you're missing appears as part of the next plot, just not colored by scaffold: Screenshot from 2021-05-20 13-51-54

petrelharp commented 3 years ago

I'll close this issue, but if I haven't actually fixed it then feel free to re-open (or just reply, I'll see it).

sjfleck commented 3 years ago

Thank you for your quick response. Do you have any recommendations for choosing a value for the -s option? As I said in my first comment, I chose my value based on the number of snps in my smallest chromosome when I was running multiple chromosomes together. I'm not sure if it was a good decision.

Lastly, I did want to compare different chromosomes. When it was done with D. melanogaster in the Li & Ralph (2019) paper, they made 1 window for each chromosome. My Chr1 has 3,495,179 snps and Chr32 has 1,496,728 snps. I see now that I have to run each chromosome seperately, so would I just set -s to 3495179 for chr1 and -s 1496728 for chr32? Thanks and I'm sure we can keep this closed

petrelharp commented 3 years ago

Do you have any recommendations for choosing a value for the -s option?

What you did to pick a window size sounds perfect! I'd also recommend doing it with smaller and larger windows (eg. 1/10th as big and 10x as big) to see if you get anything different.

Lastly, I did want to compare different chromosomes. When it was done with D. melanogaster in the Li & Ralph (2019) paper, they made 1 window for each chromosome.

I'm not sure if this is what you mean, but: what we did to make this figure was to run lostruct separately on each chromosome, with a window size of 1000 SNPs: Screenshot from 2021-05-21 08-29-17 Each point in each panel is one window - there's around 2500 windows per chromosome. The reason for doing this is that several of the chromosomes have an inversion, so looking at the chromosome individually helps the inversion show up in the PC plots (without being confused by other inversions). To do this with run_lostruct.R, you need to put the bcf files for each chromosome in separate directories, and run lostruct on each.

sjfleck commented 3 years ago

I apologize for the mixup! I read the figure text wrong. It said: "In all plots, each point represents one window along the genome"

and I somehow read that as : "In all plots, each PLOT represents one window along the genome."

What a mistake to make. Thanks for correcting me and thanks for all the help!

petrelharp commented 3 years ago

No problem! Good luck!

sjfleck commented 3 years ago

I have one final question regarding the output. I'm not seeing the % of variance that's explained by each PC. Am I missing it, or there not an option for it to be calculated? Thanks

petrelharp commented 3 years ago

Hm, good question. Perhaps it's not in the run summary, but the first column in each .pca.csv file is the percent variation explained by the PCs, per window (see the README in the templated/ directory) - it would be nice to add some summary of that to the report.

sjfleck commented 3 years ago

the description isn't very long, so I just want to make sure I understand the details

chr1.pca.csv - the percent variation explained and top two eigenvalues and eigenvectors of the windows, one window per row - output by eigen_windows().

My chromosome 01 has 2330 windows that have 1500 snps each. That's confirmed by the number of rows in my "HiC_scaffold_1.pca.csv" file. That file has a lot of columns: total, lam_1, lam_2, PC1 for all 41 spp, PC2 for all 41 spp

I get that the total column is the percent variation for PC1 and PC2 combined for each of the 2330 windows(rows). This data is used in step 1-4 and some of those windows are pulled out (corner 1, 2, and 3) for some descriptive PCAs in step 5.

The PCAs in step 5 are the ones I'm wondering about. I don't know which of the 2330 windows were used to generate those PCAs and I'm not sure how to report how much variation PC1 and PC2 explain for plot_corner_pca-1_1_2.pdf, for example. I'll attach a slide compiling some of the outputs that I'm showing to my lab.

thanks for any clarification on this

Local_PCA_Tembusu_chr1.pdf

petrelharp commented 3 years ago

The PC plots produced in step 5 (the last ones in the templated report) are produced from all the windows falling in each "corner". So, for instance, those in the "corner 1" column are found by:

a. taking all the windows in corner 1 (the green points, if your figure is labeled right - I've not double-checked) b. doing PCA on the genotype matrix obtained by putting these windows together.

The idea is that this should give you a less noisy idea of what's going on in those windows than looking at a bunch of them separately.

Looking at your plots, it looks to me like the difference between the corners is how the triangles, crosses, and diamonds relate to each other; in the centromere the crosses and diamonds are not very differentiated but the triangles are pretty different; while in the two arms the groups are more different. It's also interesting to note how the red circle changes relative position. I'm not sure if I'd read too much into the position of L. crenulata - it looks like it's always sitting in the middle, which probably means that it's just different from everyone else, but since there's only one of it, that difference isn't enough to drive a separate PC (although maybe you'd see it separating out on some higher PC).

sjfleck commented 3 years ago

Hi, I wanted to check back in with a result that I have. I took your recommendation and choose window size 10x as large and 1/10th the size. I am seeing some differences.

This example is for a different group of species than I was look at before. Last time I was looking at 41 closely related samples/species and this time I'm looking at 263 closely related samples/species. This group has fewer SNPs/chromosome. I use the same logic for choosing window size as before, the chromosome with the smallest number of SNPs had 145K, so I chose a window size of 145 (giving about ~1000 SNPs/window) for all 11 chromosomes. Was keeping the window size the same for all of them the right thing to do? For example, Chromosome 1 has 145K SNPs (the least out of all chromosomes), but Chromosome 6 has 331K SNPs (the most out of all chromosomes). Should I be changing the window size for each chromosome, or should I keep that window the same for all 11?

After looking at those results and noting which chromosomes shows PCA plots that differed from the majority, I ran Local_PCA again with a window size of 15 and a window size of 1450. I know the paper says that shorter windows allow better resolution along the genome, but provide less precise estimates of relatedness. Even still, I'm not sure what to make of these results.

lcl_scaffold_1_window_size_comparison.pdf

Window sizes of 145 and 1450 show similar results- that Chromosome 1 shows expected patterns of relatedness. On the other hand, a window size of 15 shows heavy overlapping between each group in windows 1 and 3. Since smaller windows are less reliable in terms of relatedness, do I ignore this result? what is the purpose of it, or what is it telling me. Any help on this would help greatly. Thank you again.

petrelharp commented 3 years ago

Was keeping the window size the same for all of them the right thing to do?

That makes sense to me - it makes the results on different chromosomes comparable.

On the other hand, a window size of 15 shows heavy overlapping between each group in windows 1 and 3.

Ah, that's a nice puzzle! I suspect that what this is telling you is that in the windows in groups 1 and 3, there is variation within the "orange triangles" but not much within the other samples. It's not surprising that this might happen, since you've got a lot more orange triangles than others, and 15 SNPs is pretty short. Even if some of the other samples - say, the green x's - are more polymorphic than the orange triangles, the fact there are so many more triangles makes it easier for them to grab a PC. (And, I suspect also that the pink circles have less diversity than the orange triangles?)

So, I think yes, it's not telling you anything particularly useful (unless that explanation is surprising to you).

sjfleck commented 3 years ago

Thank you again for the rapid reply and thanks for this tool. I look forward to citing your paper in at least a couple publications.

petrelharp commented 3 years ago

Thanks for showing me some interesting examples!