Closed HannahVMeyer closed 4 years ago
Regarding your question about the rotation of the plot: The 'rotation of the plot' stems from the sign of the genetic principal components. When doing principal component analysis, you do a matrix decomposition of your data, into the principal components and their loadings (roughly speaking, the loadings tell you how much each individual input genetic variant contributed to the respective principal component). Approximately, you can think of this as your data being described as
data = principal components * loadings
Looking at one component, in your case the flipped PC1 component: PC1 loadings = -PC1 -loadings. So, it doesn't matter if PC1 has a positive or negative sign, as the corresponding loadings change accordingly; to make it short, it doesn't matter for the analysis.
If you care for, maybe consistency across different datasets, you could check of PC1 of the EU samples is negative and if so, multiply all of PC1 with -1.
There is a great explanation for the signs of the principal components and their associated loadings in this stack exchange post: https://stats.stackexchange.com/questions/88880/does-the-sign-of-scores-or-of-loadings-in-pca-or-fa-have-a-meaning-may-i-revers?noredirect=1&lq=1
As described, the rotation above does not worry me, however, there seems to be a shift in sample and reference populations? Did you colour code your sample the same as the European samples of the reference population ie light blue?
I think this issue might be related to issue #24.
I will include an additional recommended step for the processing of the reference sample in the vignette. Could you try including this step and test your ancestry estimation again?
I have added the commands for removing the AT/GC SNPs from reference and sample population on the development branch, the vignette is here, specifically the section Filter reference and study data for non A-T or G-C SNPs.
Hi, I did the step but not much has changed. Yes, it is really strange. I coded my european reference population as baby blue but I haven't changed the default for the study population. I then edited the command to change the default colour for the study population to see if anything changes but it remains baby blue (like the european population, despite the default being dark blue)
check <- check_ancestry(qcdir, name=name,prefixMergedDataset=prefixMergedDataset,europeanTh=1.5,refSamplesFile=paste(refdir, "/1000g_ID2POP.txt",sep=""),refColorsFile=paste(refdir, "/1000g_PopColors.txt",sep=""),studyColor="#b62c7b",run.check_ancestry=FALSE,interactive=FALSE,verbose=TRUE,showPlinkOutput=TRUE)
All of the IDs in my data fail the European threshold check.
Can you shared the .eigenvec file for the merged dataset?
On 29 May 2020, at 04:56, Roxanna KL notifications@github.com wrote:
Hi, I did the step but not much has changed. Yes, it is really strange. I coded my european reference population as baby blue but I haven't changed the default for the study population. I then edited the command to change the default colour for the study population to see if anything changes but it remains baby blue (like the european population, despite the default being dark blue)
check <- check_ancestry(qcdir, name=name,prefixMergedDataset=prefixMergedDataset,europeanTh=1.5,refSamplesFile=paste(refdir, "/1000g_ID2POP.txt",sep=""),refColorsFile=paste(refdir, "/1000g_PopColors.txt",sep=""),studyColor="#b62c7b",run.check_ancestry=FALSE,interactive=FALSE,verbose=TRUE,showPlinkOutput=TRUE)
All of the IDs in my data fail the European threshold check.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I fixed the issues with the colour of the study samples:
There still remains the issue of the shift in PC2.
Can you check how many SNPs were excluded in each of the steps outlined in the Ancestry estimation vignette?
Have you checked the log files for each of these steps for potential issues? all relevant files should be in $qcdir/plink_log
. Could you share those with me?
Do you have many duplicate IDs in your datasets?
from all_phase3.no_ac_gt_snps.log:
Warning: At least 308294 duplicate IDs in --exclude file.
from ABCD_data_with_sex.pruned.log:
Warning: At least 2775 duplicate IDs in --extract file.
And in ABCD_data_with_sex.merge.all_phase3.log:
Warning: Multiple positions seen for variant '1_Physical.Position_-_T'.
Warning: Multiple positions seen for variant '1_Physical.Position_-_A'.
Warning: Multiple positions seen for variant '1_Physical.Position_G_A'.
Warning: Multiple positions seen for variant '1_Physical.Position_A_G'.
Warning: Multiple positions seen for variant '1_Physical.Position_T_G'.
Warning: Multiple positions seen for variant '1_Physical.Position_A_G'.
So, the all_phase3.no_ac_gt_snps.log excludes the SNPs that AC GT (files $refname.ac_gt_snps) . I have checked the .fam file of this and there are 2,504 distinct (i.e. sample of 1000g phase 3) so I am confused as to why it would say 308,294.... ABCD_data_with_sex.pruned.log is the step where we extract the pruned SNPs. There are indeed 2,775 duplicated SNPs. As for the last message - this nomenclature '1Physical.Position-_T' appears in the study dataset. There these SNPs that have these strange names but I didn't think they would affect the analysis anyway as the ones that have complete data are not duplicate (also they are not in 1000 genomes dataset). For example, in the plink log it says '1_Physical.Position_T_G' is duplicated but when I grepped for it in the genetic dataset of ABCD , there was one entry. Do you think it'd be best to exclude these "1_Physical.Position***" IDS?
h <- bfile[grep("1Physical.Position-_T",bfile$V2),] h V1 V2 V3 V4 V5 V6 1: 1 1Physical.Position-_T 0 1737914 T - 2: 11 11Physical.Position-_T 0 2196232 - T 3: 21 21Physical.Position-_T 0 28951038 - T h <- bfile[grep("1_Physical.Position_A_C",bfile$V2),] h V1 V2 V3 V4 V5 V6 1: 1 1_Physical.Position_A_C 0 7928917 A C 2: 11 11_Physical.Position_A_C 0 9749645 C A 3: 21 21_Physical.Position_A_C 0 28568183 A C h <- bfile[grep("1_Physical.Position_A_G",bfile$V2),] h V1 V2 V3 V4 V5 V6 1: 1 1_Physical.Position_A_G 0 3058249 G A 2: 11 11_Physical.Position_A_G 0 309127 G A 3: 21 21_Physical.Position_A_G 0 30948745 G A
The duplicate IDs refers to duplicate SNP IDs; It strikes me as odd, that there would be that many duplicate SNPs. How did you generate the reference sample?
Can you check these in both your reference and study samples (replace data with the respective prefixes) by running this on the command line:
cut -f 2 data.bim |sort | uniq | wc -l
sort data.bim | uniq | wc -l
wc -l data.bim
The link for the pvar in this vignette (https://cran.r-project.org/web/packages/plinkQC/vignettes/Genomes1000.pdf) could not be found (I think it may have been blocked) so I went to this plink webpage (https://www.cog-genomics.org/plink/2.0/resources) and I downloaded the following: Common sample information file (not for chrY/chrM): original, pedigree-corrected Merged dataset (excludes extra chrM samples, requires 12 GiB RAM to work with): .pgen.zst (2.25 GiB), .pvar.zst (1.26 GiB),
~$$ cut -f 2 ~/ABCD_cleaning/reference/all_phase3.bim" | sort | uniq | wc -l 82305201 $$ sort ~/ABCD_cleaning/reference/all_phase3.bim | uniq | wc -l 84358426 $$ wc -l ~ABCD_cleaning/reference/all_phase3.bim 84358431
Thanks so much for your time!
From the output above:
There are a total of 84358431 variants in your file, 5 of which are exact copies of one another (84358431-84358426). Out of these, there are only 82305201 biallelic variant, ie variant for which the rsID describes a variant with only two possible alleles; all other variants either have multiple possible alleles, which will be listed on different lines in the .bim file or possibly even different annotations. This is most likely not inly the case in the reference, but also in your sample, as indicated by one of the warning messages I pasted above. This line appears multiple times:
Warning: Multiple positions seen for variant '1_Physical.Position_A_G'.
For QC relying on individual variants, those do not matter but in analysis across variant like these ones, we run into issues.
Can you run the commands above for your data set as well:
cut -f 2 data.bim |sort | uniq | wc -l
sort data.bim | uniq | wc -l
wc -l data.bim
I would then suggest, before any other filtering or merging, to remove any duplicate variants for ancestry check:
awk 'BEGIN {OFS="\t"} a[$2]++; (a[$2] == 2) {print $2}' \
$refdir/$refname.bim > \
$qcdir/$refname.duplicate_variants
plink --bfile $refdir/$refname \
--exclude $qcdir/$refname.ac_gt_snps \
--make-bed \
--out $qcdir/$refname.no_duplicate_variants
mv $qcdir/$refname.no_duplicate_variants.log $qcdir/plink_log/$refname.no_duplicate_variants.log
Do this for the reference (as shown above) and for your dataset. Then use qcdir/$refname.no_duplicate_variants
and qcdir/$yourdata.no_duplicate_variants
as input for the next steps and see if that changes anything. If you still run into issues, please send me the new log files.
Hello,
Thanks for the suggestions. I did them, however they still seem to fail the European check. I copied the individualQC.R code to update the code with your commit as when I reinstalled plink QC it said nothing had changed since last install. It displayed the following message could not find function "makepath". I think I managed to bypass this loading the plinkQC library and then copy pasting the bits of the code that had changed. However, he data points for the study data seem to be the same colour as the reference population as in the plot last time.
Many thanks
I currently only pushed the changes to development branch of this repository. Install with:
devtools::install_github("meyer-lab-cshl/plinkQC", ref="dev")
then reload and rerun your analysis.
From the log files you sent, it seems like the warnings about the duplicate names and positions are gone, which is good.
One last step that we can try to make sure everything is the same in reference and study sample, is removing any variants from your study sample, after the qc and matching of the reference:
cut -f 2 all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bim > $qcdir/variants_to_keep
plink --bfile $qcdir/$name.pruned \
--extract $qcdir/variants_to_keep \
--make-bed \
--out $qcdir/$name.pruned.reference_matched
followed by the merge and pca:
plink --bfile $qcdir/$name.pruned.reference_matched \
--bmerge $qcdir/$refname.clean.bed $qcdir/$refname.clean.bim \
$qcdir/$refname.clean.fam \
--make-bed \
--out $qcdir/$name.merge.$refname
plink --bfile $qcdir/$name.merge.$refname \
--pca \
--out $qcdir/$name.$reference
Try running these steps and re-doing the analysis with the newly installed version of plinkQC.
Hi,
Thanks. The plot colours worked. I tried the following code but there still seems to be something going on.
name="ABCD_data_with_sex" refname="all_phase3"
cut -f 2 all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bim > $qcdir/variants_to_keep plink --bfile $qcdir/$name.no_duplicate_variants.no_ac_gt_snps.pruned \ --extract $qcdir/variants_to_keep \ --make-bed \ --out $qcdir/$name.pruned.$refname_matched
plink --bfile $qcdir/$name.pruned.$refname_matched \ --bmerge $qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bed $qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bim \ $qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.fam \ --make-bed \ --out $qcdir/$name.merge.$refname mv $qcdir/$name.no_duplicate_variants.no_ac_gt_snps.merge.$refname.log $qcdir/plink_log
plink --bfile $qcdir/ABCD_data_with_sex.merge.all_phase3 \ --pca \ --out $qcdir/$name.merge.$refname mv $qcdir/$name.$refname.log $qcdir/plink_log
glad to see the devtools install worked on the coloring issue is resolved.
I admit, I am running a bit out of ideas why this shift in PC2 is happening - I have not noticed anything like it before.
Have you checked that all your awk scripts have worked? Do the $qcdir/$name.pruned.$refname_matched.bim
and $qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bim
have the same amount of variants?
ie
wc -l $qcdir/$name.pruned.$refname_matched.bim
wc -l $qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean.bim
are the same? Could you check if there are any weird patterns of missingness between the two datasets? First use the merged dataset to designate your samples as 'cases', the reference samples as controls via:
plink -bfile $qcdir/ABCD_data_with_sex.merge.all_phase3 --make-pheno $qcdir/$name.pruned.$refname_matched.fam '*'" --make-bed --out $qcdir/ABCD_data_with_sex.merge.all_phase3.pheno
the test for differential missingness between the cases and controls:
plink -bfile $qcdir/ABCD_data_with_sex.merge.all_phase3.pheno --test-missing --out $qcdir/ABCD_data_with_sex.merge.all_phase3.pheno
Then check the output file $qcdir/ABCD_data_with_sex.merge.all_phase3.pheno.missing
for singnificant differential missing variants (i.e. p < 0.001 or so).
Thank you so much for the recommendations. The differential missingness was p<0.001 for 108,216 SNPs out of the total 189,337 variants in the files (a lot!). Would the PCA analysis work the same if I removed the SNPs with differential missingness (p<0.001) leaving 81,121 SNPs or would it better to use a different reference panel such as the HRC?
Can you run the per individual control without the ancestry estimation and then do the per variant qc? How many samples and variants would pass your filters?
On 1 Jun 2020, at 18:01, Roxanna KL notifications@github.com wrote:
Thank you so much for the recommendations. The differential missingness was p<0.001 for 108,216 SNPs out of the total 189,337 variants in the files (a lot!). Would the PCA analysis work the same if I removed the SNPs with differential missingness (p<0.001) leaving 81,121 SNPs or would it better to use a different reference panel such as the HRC?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
How many differential missing ones are there at a lower p-value threshold like p < 0.00001? You can find them straight away by adding the --pfilter 0.00001
flag to the command.
Hi I've submitted the per-individual report to run. I've checked for SNPs at p<0.00001 and there are are 74,401 variants.
While you are waiting for your study qc, could you also run both qc step on your processed reference sample ($qcdir/all_phase3.no_duplicate_variants.no_ac_gt_snps.clean)? This high degree of differential missingness is surprising to me, and I would suspect either of the merged datasets to have some qc issues. The reference data shouldn't have but maybe something went wrong in the conversion.
I did worry that I may have downloaded the wrong files but they seem correct. Do you mean to rerun the steps here https://meyer-lab-cshl.github.io/plinkQC/articles/AncestryCheck.html?
No, I mean the standard QC, but on the reference data:
# set path to plink software
path2plink="path/to/your/plink/software"
fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name="all_phase3.no_duplicate_variants.no_ac_gt_snps.clean", dont.check_ancestry =TRUE, path2plink=path2plink, verbose=TRUE, interactive=TRUE) overview_individuals <- overviewPerIndividualQC(fail_individuals, interactive=TRUE)
2.
```r
fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name="all_phase3.no_duplicate_variants.no_ac_gt_snps.clean",
path2plink=path2plink,
verbose=TRUE, interactive=TRUE)
overview_marker <- overviewPerMarkerQC(fail_markers, interactive=TRUE)
Quite a few individuals and markers failed QC.
Is this your data or the reference data set?
On 2 Jun 2020, at 20:00, Roxanna KL notifications@github.com wrote: Quite a few individuals and markers failed QC.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
its the reference dataset.
I ran the following, setting dont.check_relatedness=FALSE, as I had already generated this file (which took a long time) and I've attached the output. I thought that would incorporate the already created relatedness fail IBD file but I don't think it has, as shown in the figures. Is there any way of incorporating already created files into the code? This is QC on the original study genetic dataset.
(https://user-images.githubusercontent.com/30495388/83638286-daa7c580-a5a0-11ea-9725-be1514a21ca5.png) ![overview_individuals_ABCD]
fail_individuals <- perIndividualQC(indir=qcdir, qcdir=qcdir, name="ABCD_data_with_sex", dont.check_ancestry =TRUE,dont.check_relatedness=TRUE, path2plink=path2plink, verbose=TRUE, interactive=TRUE)
ggplot2::ggsave("fail_individuals_ABCD.png")
overview_individuals <- overviewPerIndividualQC(fail_individuals, interactive=FALSE) ggplot2::ggsave("overview_individuals_ABCD.png")
fail_markers <- perMarkerQC(indir=qcdir, qcdir=qcdir, name="all_phase3.no_duplicate_variants.no_ac_gt_snps.clean", path2plink=path2plink, verbose=TRUE, interactive=FALSE) ggplot2::ggsave("fail_markers_ABCD.png")
overview_marker <- overviewPerMarkerQC(fail_markers, interactive=FALSE)
ggplot2::ggsave("overview_markers_ABCD.png"
no, currently there is no option to provide an already computed list of fail-IDs, but that might be something to consider implementing for the future. But in this case, I was also more interested in individuals failing due to missingness and outlying heterozygosity. Could you show the perIndividualQC
results plots as well please?
just as a point of caution: here, you ran perIndividualQC
on the whole dataset, but perMarkerQC
on the subset used for the merge only. This might have been accidental, but want to caution you to use these results downstream.
I do not know what is going wrong between your samples and the merge with the reference data. I would recommend for now to try and download the HapMap dataset as described in this vignette and try the merge with these and see if the problem persists.
Did you manage to solve this issue? If so, could you let us know what worked? Thanks
Sorry - I thought I had updated this. It was due to quite different allele frequencies for some SNPs between the study sample and 1000 genomes. Once, I filtered these out, it worked! Thanks for your help!
Could you provide the steps you have taken? It would be useful for other people that come across this thread to see what worked for you. I can also add it to the vignette.
I used a effect allele frequency diagnostic plot mapping the effect alllele frequencies of my study dataset and reference dataset.
library(data.table)
eaf_study <-fread("study_freq.frq") eaf_allphase3 <- fread("allphase3_freq.frq")
colnames(eaf_study)[5] <- "freq_study" eaf_study$identifier <- "study" eaf_allphase3$identifier <- "allphase3" colnames(eaf_allphase3)[5] <- "freq_allphase3" data_merge <- merge(eaf_study,eaf_allphase3,by="SNP")
tiff("~qcdir/scatterplot.tiff") plot <- plot(x=data_merge$freq_allphase3, y=data_merge$freq_study,xlab="EAF_1000g",ylab="EAF_study") dev.off()
This made me realise that one step in the pre-processing had not worked due to an error carried forward in the naming :( .
Thank you for providing the information. Before I close this issue, can I confirm that once you solved this naming error, you arrived at merged reference set suitable for the analysis based on the vignette I provided? If not, what did you have to add in the vignette? Thanks
Yes, everything worked great and sorry for taking up your time on this!
Glad you got it sorted in the end. Happy to help.
Hi Just to update you, the relatedness check worked and I think it just needed more time than I had initially thought to complete (It completed in 90 hours). In the end, 1756 individuals were in the fail-IBDs text file.
I'd like to ask another question (sorry it's my first time doing the QC). I generated a PCA plot with the 1000 genomes phase 3 data and did the preprocessing in this link (https://cran.r-project.org/web/packages/plinkQC/vignettes/Genomes1000.pdf) but my plot looks rotated. Can you suggest a reason this may happen?
Many thanks
Originally posted by @rskl92 in https://github.com/meyer-lab-cshl/plinkQC/issues/23#issuecomment-634227553