Open hollygene opened 3 years ago
Do we want to filter by alignment quality or mappability, or both?
Difference is: Alignment quality is the quality of matching between your sample and the reference Mapping quality is the confidence that the read is actually mapped to the right site
Ex: could have a read that aligns perfectly in several locations, this gives it a high alignment score, but you don't have confidence where it actually belongs so you have a low mappability score
Resources: GeM for mappability: https://www.biorxiv.org/content/10.1101/611160v1.full
Gblocks for alignment: http://molevol.cmima.csic.es/castresana/Gblocks.html
Problem running treeWAS on R on server: running into memory issues
Still had issues running treeWAS: I’m pretty sure the reason I can’t get it to work is because we have 500k SNPs. I even cut down the number of SNPs by removing some duplicates, but the matrix is still ~600 x 500000 which makes for 300 million data points… A task R just can’t really handle. I’m thinking if we run Gblocks on the alignment, this might eliminate some places where there were a lot of SNPs (poor alignment = more SNPs?). Then on the newer, smaller alignment, we can use snp-sites again, and hopefully this will give us a more reasonable number of SNPs. Also, treeWAS eliminates any samples that don’t have phenotypic data, so it will be best to get rid of those samples in the alignment as well (for this analysis).
Ran Gblocks on core gene alignment https://github.com/hollygene/CornellPostdoc/blob/c854039b23794a72b4e474a49ee9b5eb61836f4f/hpc_scripts.sh#L58-L60
Running snp-sites on new Gblocks alignment now
output files from Gblocks: Ecoli_subset_snpsGblocks.snp_sites.aln
This file is subsettted to only the samples we have phenotypic data for as well as the one long branch sequence removed
treeWAS output: print(out,sort.by.p=TRUE) treeWAS.combined: pooled set of significant loci identified by any association score treeWAS: a list with the sets of significant loci determined by each association score individually
Some samples in phenotypic dataset that aren't dogs: strain ECOL-17-VL-LA-KS-0009 Bos taurus strain ECOL-17-VL-LA-TX-0003 Canis rufus strain ECOL-17-VL-SD-NC-0028 Felis catus strain ECOL-17-VL-LA-KS-0046 Felis catus strain ECOL-17-VL-LA-KS-0031 Sus scrofa domesticus strain ECOL-18-VL-SD-NC-0026 Equus ferus caballus strain ECOL-18-VL-SD-NC-0001 Equus ferus caballus strain ECOL-18-VL-SD-NC-0003 Felis catus strain ECOL-18-VL-LA-KS-0004 Sus scrofa domesticus strain ECOL-19-VL-LA-KS-0035 Felis catus
(Removed the above samples)
file with samples we want: https://github.com/hollygene/CornellPostdoc/blob/98881dff0653805c747fde0a7e3648f26356de6c/phenDogs_samples_wanted.txt
identified by assemblies
if that doesn't work I can use a different ID (strain, I think)
Run R on AHDC server:
screen export OMP_NUM_THREADS=8 R
Could be beneficial to use ClonalFrameML with treeWAS to correct for recombination
treeWAS has information on how to incorporate the two: https://github.com/caitiecollins/treeWAS/wiki/5.-ClonalFrameML-Integration
Maybe need to run Gblocks or GEM mappability on this dataset as there is that one odd sequence out in the tree