meyer-lab-cshl / plinkQC

R package for quality control of plink genetic datasets
Other
55 stars 28 forks source link

Shifted PCA plots with respect to reference sample #25

Closed HannahVMeyer closed 4 years ago

HannahVMeyer commented 4 years ago

Hi Just to update you, the relatedness check worked and I think it just needed more time than I had initially thought to complete (It completed in 90 hours). In the end, 1756 individuals were in the fail-IBDs text file.

I'd like to ask another question (sorry it's my first time doing the QC). I generated a PCA plot with the 1000 genomes phase 3 data and did the preprocessing in this link (https://cran.r-project.org/web/packages/plinkQC/vignettes/Genomes1000.pdf) but my plot looks rotated. Can you suggest a reason this may happen? p_ancestry

Many thanks

Originally posted by @rskl92 in https://github.com/meyer-lab-cshl/plinkQC/issues/23#issuecomment-634227553

HannahVMeyer commented 4 years ago

Regarding your question about the rotation of the plot: The 'rotation of the plot' stems from the sign of the genetic principal components. When doing principal component analysis, you do a matrix decomposition of your data, into the principal components and their loadings (roughly speaking, the loadings tell you how much each individual input genetic variant contributed to the respective principal component). Approximately, you can think of this as your data being described as

data = principal components * loadings

Looking at one component, in your case the flipped PC1 component: PC1 loadings = -PC1 -loadings. So, it doesn't matter if PC1 has a positive or negative sign, as the corresponding loadings change accordingly; to make it short, it doesn't matter for the analysis.

If you care for, maybe consistency across different datasets, you could check of PC1 of the EU samples is negative and if so, multiply all of PC1 with -1.

There is a great explanation for the signs of the principal components and their associated loadings in this stack exchange post: https://stats.stackexchange.com/questions/88880/does-the-sign-of-scores-or-of-loadings-in-pca-or-fa-have-a-meaning-may-i-revers?noredirect=1&lq=1

HannahVMeyer commented 4 years ago

As described, the rotation above does not worry me, however, there seems to be a shift in sample and reference populations? Did you colour code your sample the same as the European samples of the reference population ie light blue?

I think this issue might be related to issue #24.

I will include an additional recommended step for the processing of the reference sample in the vignette. Could you try including this step and test your ancestry estimation again?

HannahVMeyer commented 4 years ago

I have added the commands for removing the AT/GC SNPs from reference and sample population on the development branch, the vignette is here, specifically the section Filter reference and study data for non A-T or G-C SNPs.

rskl92 commented 4 years ago

Hi, I did the step but not much has changed. Yes, it is really strange. I coded my european reference population as baby blue but I haven't changed the default for the study population. I then edited the command to change the default colour for the study population to see if anything changes but it remains baby blue (like the european population, despite the default being dark blue)

check <- check_ancestry(qcdir, name=name,prefixMergedDataset=prefixMergedDataset,europeanTh=1.5,refSamplesFile=paste(refdir, "/1000g_ID2POP.txt",sep=""),refColorsFile=paste(refdir, "/1000g_PopColors.txt",sep=""),studyColor="#b62c7b",run.check_ancestry=FALSE,interactive=FALSE,verbose=TRUE,showPlinkOutput=TRUE)

All of the IDs in my data fail the European threshold check.

HannahVMeyer commented 4 years ago

Can you shared the .eigenvec file for the merged dataset?

On 29 May 2020, at 04:56, Roxanna KL notifications@github.com wrote:

 Hi, I did the step but not much has changed. Yes, it is really strange. I coded my european reference population as baby blue but I haven't changed the default for the study population. I then edited the command to change the default colour for the study population to see if anything changes but it remains baby blue (like the european population, despite the default being dark blue)

check <- check_ancestry(qcdir, name=name,prefixMergedDataset=prefixMergedDataset,europeanTh=1.5,refSamplesFile=paste(refdir, "/1000g_ID2POP.txt",sep=""),refColorsFile=paste(refdir, "/1000g_PopColors.txt",sep=""),studyColor="#b62c7b",run.check_ancestry=FALSE,interactive=FALSE,verbose=TRUE,showPlinkOutput=TRUE)

All of the IDs in my data fail the European threshold check.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

HannahVMeyer commented 4 years ago

I fixed the issues with the colour of the study samples:

image

HannahVMeyer commented 4 years ago

There still remains the issue of the shift in PC2.

Can you check how many SNPs were excluded in each of the steps outlined in the Ancestry estimation vignette?

Have you checked the log files for each of these steps for potential issues? all relevant files should be in $qcdir/plink_log. Could you share those with me?

HannahVMeyer commented 4 years ago

Do you have many duplicate IDs in your datasets?

from all_phase3.no_ac_gt_snps.log:

Warning: At least 308294 duplicate IDs in --exclude file.

from ABCD_data_with_sex.pruned.log:

Warning: At least 2775 duplicate IDs in --extract file.

And in ABCD_data_with_sex.merge.all_phase3.log:

Warning: Multiple positions seen for variant '1_Physical.Position_-_T'.
Warning: Multiple positions seen for variant '1_Physical.Position_-_A'.
Warning: Multiple positions seen for variant '1_Physical.Position_G_A'.
Warning: Multiple positions seen for variant '1_Physical.Position_A_G'.
Warning: Multiple positions seen for variant '1_Physical.Position_T_G'.
Warning: Multiple positions seen for variant '1_Physical.Position_A_G'.
rskl92 commented 4 years ago

So, the all_phase3.no_ac_gt_snps.log excludes the SNPs that AC GT (files $refname.ac_gt_snps) . I have checked the .fam file of this and there are 2,504 distinct (i.e. sample of 1000g phase 3) so I am confused as to why it would say 308,294.... ABCD_data_with_sex.pruned.log is the step where we extract the pruned SNPs. There are indeed 2,775 duplicated SNPs. As for the last message - this nomenclature '1Physical.Position-_T' appears in the study dataset. There these SNPs that have these strange names but I didn't think they would affect the analysis anyway as the ones that have complete data are not duplicate (also they are not in 1000 genomes dataset). For example, in the plink log it says '1_Physical.Position_T_G' is duplicated but when I grepped for it in the genetic dataset of ABCD , there was one entry. Do you think it'd be best to exclude these "1_Physical.Position***" IDS?

h <- bfile[grep("1Physical.Position-_T",bfile$V2),] h V1 V2 V3 V4 V5 V6 1: 1 1Physical.Position-_T 0 1737914 T - 2: 11 11Physical.Position-_T 0 2196232 - T 3: 21 21Physical.Position-_T 0 28951038 - T h <- bfile[grep("1_Physical.Position_A_C",bfile$V2),] h V1 V2 V3 V4 V5 V6 1: 1 1_Physical.Position_A_C 0 7928917 A C 2: 11 11_Physical.Position_A_C 0 9749645 C A 3: 21 21_Physical.Position_A_C 0 28568183 A C h <- bfile[grep("1_Physical.Position_A_G",bfile$V2),] h V1 V2 V3 V4 V5 V6 1: 1 1_Physical.Position_A_G 0 3058249 G A 2: 11 11_Physical.Position_A_G 0 309127 G A 3: 21 21_Physical.Position_A_G 0 30948745 G A

HannahVMeyer commented 4 years ago

The duplicate IDs refers to duplicate SNP IDs; It strikes me as odd, that there would be that many duplicate SNPs. How did you generate the reference sample?

Can you check these in both your reference and study samples (replace data with the respective prefixes) by running this on the command line:

cut -f 2 data.bim |sort | uniq | wc -l
sort data.bim | uniq | wc -l
wc -l data.bim
rskl92 commented 4 years ago

The link for the pvar in this vignette (https://cran.r-project.org/web/packages/plinkQC/vignettes/Genomes1000.pdf) could not be found (I think it may have been blocked) so I went to this plink webpage (https://www.cog-genomics.org/plink/2.0/resources) and I downloaded the following: Common sample information file (not for chrY/chrM): original, pedigree-corrected Merged dataset (excludes extra chrM samples, requires 12 GiB RAM to work with): .pgen.zst (2.25 GiB), .pvar.zst (1.26 GiB),

~$$ cut -f 2 ~/ABCD_cleaning/reference/all_phase3.bim" | sort | uniq | wc -l 82305201 $$ sort ~/ABCD_cleaning/reference/all_phase3.bim | uniq | wc -l 84358426 $$ wc -l ~ABCD_cleaning/reference/all_phase3.bim 84358431

Thanks so much for your time!

HannahVMeyer commented 4 years ago

From the output above:

There are a total of 84358431 variants in your file, 5 of which are exact copies of one another (84358431-84358426). Out of these, there are only 82305201 biallelic variant, ie variant for which the rsID describes a variant with only two possible alleles; all other variants either have multiple possible alleles, which will be listed on different lines in the .bim file or possibly even different annotations. This is most likely not inly the case in the reference, but also in your sample, as indicated by one of the warning messages I pasted above. This line appears multiple times:

Warning: Multiple positions seen for variant '1_Physical.Position_A_G'.

For QC relying on individual variants, those do not matter but in analysis across variant like these ones, we run into issues.

Can you run the commands above for your data set as well:

cut -f 2 data.bim |sort | uniq | wc -l
sort data.bim | uniq | wc -l
wc -l data.bim

I would then suggest, before any other filtering or merging, to remove any duplicate variants for ancestry check:

awk 'BEGIN {OFS="\t"}  a[$2]++; (a[$2] == 2) {print $2}' \
    $refdir/$refname.bim  > \
    $qcdir/$refname.duplicate_variants

plink --bfile  $refdir/$refname \
      --exclude $qcdir/$refname.ac_gt_snps \
      --make-bed \
      --out $qcdir/$refname.no_duplicate_variants
mv  $qcdir/$refname.no_duplicate_variants.log $qcdir/plink_log/$refname.no_duplicate_variants.log

Do this for the reference (as shown above) and for your dataset. Then use qcdir/$refname.no_duplicate_variants and qcdir/$yourdata.no_duplicate_variants as input for the next steps and see if that changes anything. If you still run into issues, please send me the new log files.

rskl92 commented 4 years ago

Hello,

Thanks for the suggestions. I did them, however they still seem to fail the European check. I copied the individualQC.R code to update the code with your commit as when I reinstalled plink QC it said nothing had changed since last install. It displayed the following message could not find function "makepath". I think I managed to bypass this loading the plinkQC library and then copy pasting the bits of the code that had changed. However, he data points for the study data seem to be the same colour as the reference population as in the plot last time.

Many thanks

HannahVMeyer commented 4 years ago

I currently only pushed the changes to development branch of this repository. Install with:

devtools::install_github("meyer-lab-cshl/plinkQC", ref="dev")

then reload and rerun your analysis.

From the log files you sent, it seems like the warnings about the duplicate names and positions are gone, which is good.

One last step that we can try to make sure everything is the same in reference and study sample, is removing any variants from your study sample, after the qc and matching of the reference:

cut -f 2 all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bim > $qcdir/variants_to_keep
plink --bfile  $qcdir/$name.pruned \
      --extract $qcdir/variants_to_keep \
      --make-bed \
      --out $qcdir/$name.pruned.reference_matched

followed by the merge and pca:

plink --bfile $qcdir/$name.pruned.reference_matched  \
      --bmerge $qcdir/$refname.clean.bed $qcdir/$refname.clean.bim \
         $qcdir/$refname.clean.fam  \
      --make-bed \
      --out $qcdir/$name.merge.$refname

plink --bfile $qcdir/$name.merge.$refname \
      --pca \
      --out $qcdir/$name.$reference

Try running these steps and re-doing the analysis with the newly installed version of plinkQC.

rskl92 commented 4 years ago

Hi,

Thanks. The plot colours worked. I tried the following code but there still seems to be something going on.

name="ABCD_data_with_sex" refname="all_phase3"

cut -f 2 all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bim > $qcdir/variants_to_keep plink --bfile $qcdir/$name.no_duplicate_variants.no_ac_gt_snps.pruned \ --extract $qcdir/variants_to_keep \ --make-bed \ --out $qcdir/$name.pruned.$refname_matched

plink --bfile $qcdir/$name.pruned.$refname_matched \ --bmerge $qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bed $qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bim \ $qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.fam \ --make-bed \ --out $qcdir/$name.merge.$refname mv $qcdir/$name.no_duplicate_variants.no_ac_gt_snps.merge.$refname.log $qcdir/plink_log

PCA on the merged data

We can now run principal component analysis on the combined dataset using

plink --pca which returns a .eigenvec file with the family and individual ID

in columns 1 and 2, followed by the first 20 principal components.

plink --bfile $qcdir/ABCD_data_with_sex.merge.all_phase3 \ --pca \ --out $qcdir/$name.merge.$refname mv $qcdir/$name.$refname.log $qcdir/plink_log

ancestry

HannahVMeyer commented 4 years ago

glad to see the devtools install worked on the coloring issue is resolved.

I admit, I am running a bit out of ideas why this shift in PC2 is happening - I have not noticed anything like it before.

Have you checked that all your awk scripts have worked? Do the $qcdir/$name.pruned.$refname_matched.bim and $qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bim have the same amount of variants?

ie

wc -l $qcdir/$name.pruned.$refname_matched.bim
wc -l $qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bim

are the same? Could you check if there are any weird patterns of missingness between the two datasets? First use the merged dataset to designate your samples as 'cases', the reference samples as controls via:

plink -bfile  $qcdir/ABCD_data_with_sex.merge.all_phase3 --make-pheno $qcdir/$name.pruned.$refname_matched.fam '*'" --make-bed --out $qcdir/ABCD_data_with_sex.merge.all_phase3.pheno

the test for differential missingness between the cases and controls:

plink -bfile $qcdir/ABCD_data_with_sex.merge.all_phase3.pheno --test-missing --out $qcdir/ABCD_data_with_sex.merge.all_phase3.pheno

Then check the output file $qcdir/ABCD_data_with_sex.merge.all_phase3.pheno.missing for singnificant differential missing variants (i.e. p < 0.001 or so).

rskl92 commented 4 years ago

Thank you so much for the recommendations. The differential missingness was p<0.001 for 108,216 SNPs out of the total 189,337 variants in the files (a lot!). Would the PCA analysis work the same if I removed the SNPs with differential missingness (p<0.001) leaving 81,121 SNPs or would it better to use a different reference panel such as the HRC?

HannahVMeyer commented 4 years ago

Can you run the per individual control without the ancestry estimation and then do the per variant qc? How many samples and variants would pass your filters?

On 1 Jun 2020, at 18:01, Roxanna KL notifications@github.com wrote:

 Thank you so much for the recommendations. The differential missingness was p<0.001 for 108,216 SNPs out of the total 189,337 variants in the files (a lot!). Would the PCA analysis work the same if I removed the SNPs with differential missingness (p<0.001) leaving 81,121 SNPs or would it better to use a different reference panel such as the HRC?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

HannahVMeyer commented 4 years ago

How many differential missing ones are there at a lower p-value threshold like p < 0.00001? You can find them straight away by adding the --pfilter 0.00001 flag to the command.

rskl92 commented 4 years ago

Hi I've submitted the per-individual report to run. I've checked for SNPs at p<0.00001 and there are are 74,401 variants.

HannahVMeyer commented 4 years ago

While you are waiting for your study qc, could you also run both qc step on your processed reference sample ($qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean)? This high degree of differential missingness is surprising to me, and I would suspect either of the merged datasets to have some qc issues. The reference data shouldn't have but maybe something went wrong in the conversion.

rskl92 commented 4 years ago

I did worry that I may have downloaded the wrong files but they seem correct. Do you mean to rerun the steps here https://meyer-lab-cshl.github.io/plinkQC/articles/AncestryCheck.html?

HannahVMeyer commented 4 years ago

No, I mean the standard QC, but on the reference data:

  1. perIndiviudalQC without ancestry check:
    
    # set path to plink software
    path2plink="path/to/your/plink/software"

fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name="all_phase3.no_duplicate_variants.no_ac_gt_snps.clean", dont.check_ancestry =TRUE, path2plink=path2plink, verbose=TRUE, interactive=TRUE) overview_individuals <- overviewPerIndividualQC(fail_individuals, interactive=TRUE)


2. 
```r
fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name="all_phase3.no_duplicate_variants.no_ac_gt_snps.clean",
                            path2plink=path2plink,
                            verbose=TRUE, interactive=TRUE)
overview_marker <- overviewPerMarkerQC(fail_markers, interactive=TRUE)
rskl92 commented 4 years ago

Quite a few individuals and markers failed QC. overview_markers overview_individuals

HannahVMeyer commented 4 years ago

Is this your data or the reference data set?

On 2 Jun 2020, at 20:00, Roxanna KL notifications@github.com wrote:  Quite a few individuals and markers failed QC.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

rskl92 commented 4 years ago

its the reference dataset.

rskl92 commented 4 years ago

I ran the following, setting dont.check_relatedness=FALSE, as I had already generated this file (which took a long time) and I've attached the output. I thought that would incorporate the already created relatedness fail IBD file but I don't think it has, as shown in the figures. Is there any way of incorporating already created files into the code? This is QC on the original study genetic dataset.

(https://user-images.githubusercontent.com/30495388/83638286-daa7c580-a5a0-11ea-9725-be1514a21ca5.png) ![overview_individuals_ABCD]

fail_individuals <- perIndividualQC(indir=qcdir, qcdir=qcdir, name="ABCD_data_with_sex", dont.check_ancestry =TRUE,dont.check_relatedness=TRUE, path2plink=path2plink, verbose=TRUE, interactive=TRUE)

ggplot2::ggsave("fail_individuals_ABCD.png")

overview_individuals <- overviewPerIndividualQC(fail_individuals, interactive=FALSE) ggplot2::ggsave("overview_individuals_ABCD.png")

fail_markers_ABCD

fail_markers <- perMarkerQC(indir=qcdir, qcdir=qcdir, name="all_phase3.no_duplicate_variants.no_ac_gt_snps.clean", path2plink=path2plink, verbose=TRUE, interactive=FALSE) ggplot2::ggsave("fail_markers_ABCD.png")

overview_marker <- overviewPerMarkerQC(fail_markers, interactive=FALSE)

ggplot2::ggsave("overview_markers_ABCD.png"

HannahVMeyer commented 4 years ago

no, currently there is no option to provide an already computed list of fail-IDs, but that might be something to consider implementing for the future. But in this case, I was also more interested in individuals failing due to missingness and outlying heterozygosity. Could you show the perIndividualQC results plots as well please?

rskl92 commented 4 years ago

overview_individuals_ABCD

HannahVMeyer commented 4 years ago

just as a point of caution: here, you ran perIndividualQC on the whole dataset, but perMarkerQC on the subset used for the merge only. This might have been accidental, but want to caution you to use these results downstream.

I do not know what is going wrong between your samples and the merge with the reference data. I would recommend for now to try and download the HapMap dataset as described in this vignette and try the merge with these and see if the problem persists.

HannahVMeyer commented 4 years ago

Did you manage to solve this issue? If so, could you let us know what worked? Thanks

rskl92 commented 4 years ago

Sorry - I thought I had updated this. It was due to quite different allele frequencies for some SNPs between the study sample and 1000 genomes. Once, I filtered these out, it worked! Thanks for your help!

HannahVMeyer commented 4 years ago

Could you provide the steps you have taken? It would be useful for other people that come across this thread to see what worked for you. I can also add it to the vignette.

rskl92 commented 4 years ago

I used a effect allele frequency diagnostic plot mapping the effect alllele frequencies of my study dataset and reference dataset.

library(data.table)

eaf_study <-fread("study_freq.frq") eaf_allphase3 <- fread("allphase3_freq.frq")

colnames(eaf_study)[5] <- "freq_study" eaf_study$identifier <- "study" eaf_allphase3$identifier <- "allphase3" colnames(eaf_allphase3)[5] <- "freq_allphase3" data_merge <- merge(eaf_study,eaf_allphase3,by="SNP")

tiff("~qcdir/scatterplot.tiff") plot <- plot(x=data_merge$freq_allphase3, y=data_merge$freq_study,xlab="EAF_1000g",ylab="EAF_study") dev.off()

This made me realise that one step in the pre-processing had not worked due to an error carried forward in the naming :( .

HannahVMeyer commented 4 years ago

Thank you for providing the information. Before I close this issue, can I confirm that once you solved this naming error, you arrived at merged reference set suitable for the analysis based on the vignette I provided? If not, what did you have to add in the vignette? Thanks

rskl92 commented 4 years ago

Yes, everything worked great and sorry for taking up your time on this!

HannahVMeyer commented 4 years ago

Glad you got it sorted in the end. Happy to help.