hollygene / CornellPostdoc

Various scripts I used during my tenure as a Postdoctoral Associate at Cornell University
0 stars 0 forks source link

Run treeWAS #3

Open hollygene opened 3 years ago

hollygene commented 3 years ago

Maybe need to run Gblocks or GEM mappability on this dataset as there is that one odd sequence out in the tree

hollygene commented 3 years ago

Do we want to filter by alignment quality or mappability, or both?

Difference is: Alignment quality is the quality of matching between your sample and the reference Mapping quality is the confidence that the read is actually mapped to the right site

Ex: could have a read that aligns perfectly in several locations, this gives it a high alignment score, but you don't have confidence where it actually belongs so you have a low mappability score

hollygene commented 3 years ago

Resources: GeM for mappability: https://www.biorxiv.org/content/10.1101/611160v1.full

Gblocks for alignment: http://molevol.cmima.csic.es/castresana/Gblocks.html

hollygene commented 3 years ago

Problem running treeWAS on R on server: running into memory issues

hollygene commented 3 years ago

Still had issues running treeWAS: I’m pretty sure the reason I can’t get it to work is because we have 500k SNPs. I even cut down the number of SNPs by removing some duplicates, but the matrix is still ~600 x 500000 which makes for 300 million data points… A task R just can’t really handle. I’m thinking if we run Gblocks on the alignment, this might eliminate some places where there were a lot of SNPs (poor alignment = more SNPs?). Then on the newer, smaller alignment, we can use snp-sites again, and hopefully this will give us a more reasonable number of SNPs. Also, treeWAS eliminates any samples that don’t have phenotypic data, so it will be best to get rid of those samples in the alignment as well (for this analysis).

hollygene commented 3 years ago

Ran Gblocks on core gene alignment https://github.com/hollygene/CornellPostdoc/blob/c854039b23794a72b4e474a49ee9b5eb61836f4f/hpc_scripts.sh#L58-L60

Running snp-sites on new Gblocks alignment now

hollygene commented 3 years ago

output files from Gblocks: Ecoli_subset_snpsGblocks.snp_sites.aln

This file is subsettted to only the samples we have phenotypic data for as well as the one long branch sequence removed

hollygene commented 3 years ago

treeWAS output: print(out,sort.by.p=TRUE) treeWAS.combined: pooled set of significant loci identified by any association score treeWAS: a list with the sets of significant loci determined by each association score individually

hollygene commented 3 years ago

Some samples in phenotypic dataset that aren't dogs: strain ECOL-17-VL-LA-KS-0009 Bos taurus strain ECOL-17-VL-LA-TX-0003 Canis rufus strain ECOL-17-VL-SD-NC-0028 Felis catus strain ECOL-17-VL-LA-KS-0046 Felis catus strain ECOL-17-VL-LA-KS-0031 Sus scrofa domesticus strain ECOL-18-VL-SD-NC-0026 Equus ferus caballus strain ECOL-18-VL-SD-NC-0001 Equus ferus caballus strain ECOL-18-VL-SD-NC-0003 Felis catus strain ECOL-18-VL-LA-KS-0004 Sus scrofa domesticus strain ECOL-19-VL-LA-KS-0035 Felis catus

hollygene commented 3 years ago

(Removed the above samples)

hollygene commented 3 years ago

file with samples we want: https://github.com/hollygene/CornellPostdoc/blob/98881dff0653805c747fde0a7e3648f26356de6c/phenDogs_samples_wanted.txt

identified by assemblies

if that doesn't work I can use a different ID (strain, I think)

hollygene commented 3 years ago

Run R on AHDC server:

screen export OMP_NUM_THREADS=8 R

hollygene commented 3 years ago

Could be beneficial to use ClonalFrameML with treeWAS to correct for recombination treeWAS has information on how to incorporate the two: https://github.com/caitiecollins/treeWAS/wiki/5.-ClonalFrameML-Integration