Closed hputnam closed 7 years ago
I was just writing something up, but I can copy it here! As far as MethylExtract, it works, but it seems to output SNP and CG Methylation files for each individual scaffold, resulting in lots of files. I'm working on a way to manage those now, but the active support seems to be a good thing!
The new geoduck data came in today and I wanted to write down the outline for the process I was imagining for the analysis.
Trim Galore. No RRBS option, paired end reads. Default everything else
Bismark Genome Prep. Using the 10k-Scaff file.
Bismark. 1 allowed mismatch (--n 1) (Bismark had better mapping efficiency for 100bp reads than BSMap, and seems to run faster than BSMap, along with being more palatable for downstream analysis)
Methyl Extract with both Methylation and SNP calls. Arguments for SNP calling are minQuality = 30 (minQ =30), Min Nucleotide Frequency = 0.05 (varFraction=0.05), alpha level 0.01 (maxPval=0.01). For Paired end reads I'm using Watson sam flags 99 and 147 (flagW=99,147) and Crick sam flags 83 and 163 (flagC=83,163). Also getting wig and bed files output.
From here, we need to get a relatedness matrix for the MACAU model. I believe we can use either vcftools with the --relatedness argument, or a program called PLINK, but I'm in the midst of reading about this so cannot be 100% for sure.
Once we have the Relatedness matrix, the MACAU model should give us significantly differentially methylated regions by going though and looking at all regions via their model.
After we have DMRs, where do we want to go?
Is it the 70K file for geoduck? https://github.com/laurahspencer/546-Bioinformatics/tree/master/2016-10_Geo-Ann-Project
Also will you be running the lambda phage genome with steps 1-4?
@seanb80 @sr320 could we schedule a meeting to discuss pipeline and model?
How about Wednesday for meeting? On Mon, Jan 9, 2017 at 6:27 PM Hollie Putnam notifications@github.com wrote:
Is it the 70K file for geoduck? https://github.com/laurahspencer/546-Bioinformatics/tree/master/2016-10_Geo-Ann-Project
Also will you be running the lambda phage genome with steps 1-4?
@seanb80 https://github.com/seanb80 @sr320 https://github.com/sr320 could we schedule a meeting to discuss pipeline and model?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/421#issuecomment-271468964, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHtxeMHERubdv_zKqvEbwaM3QFbTU4ks5rQuxxgaJpZM4LexpX .
Yes Laura did 70k - I think I started the 10k annotations
On Mon, Jan 9, 2017 at 6:27 PM Hollie Putnam notifications@github.com wrote:
Is it the 70K file for geoduck? https://github.com/laurahspencer/546-Bioinformatics/tree/master/2016-10_Geo-Ann-Project
Also will you be running the lambda phage genome with steps 1-4?
@seanb80 https://github.com/seanb80 @sr320 https://github.com/sr320 could we schedule a meeting to discuss pipeline and model?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/421#issuecomment-271468964, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHtxeMHERubdv_zKqvEbwaM3QFbTU4ks5rQuxxgaJpZM4LexpX .
Yes, can do! After lab meeting?
FYI Hollie all the final files are in the results folder on my 546 repo.
Sounds good - will do after lab meeting, On Mon, Jan 9, 2017 at 6:39 PM Hollie Putnam notifications@github.com wrote:
Yes, can do! After lab meeting?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/421#issuecomment-271470645, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHtyhSbAdN9IFrR52T6SuhhTkCW5xRks5rQu8lgaJpZM4LexpX .
I was planning on running steps 2-4 with both the lambda and the geoduck. Sorry I didn't specify that!
Post regarding MethylExtract for discussion at the meeting is here
Upshot: Something is weird with Methyl Extract and it's not completing it's run when using an input directory with all sample files, so I'm running it with a single file to see if that works better/differently. Also going to get BisSNP going as well, as that may be a functioning option.
Was the same reference genome used for Bismark and MethylExtract? I noticed some differences in scaffold numbers between the CoGe data and Bismark data, so was thinking different files may have been mapped to...
Yes, the same reference file was used.
Looks like it will work on a single file, just not multiple. I think this is caused by an issue with how they form the hashmap they search over with a multi-fasta file, but my perl isn't super great so it's just a guess.
Output for MethylExtract on EPI-103 found here.
Warning, the .zip file in there contains a .sam file for every scaffold in the reference genome, and has ~12,500 files in it. I copied the summary files to the directory, which may be more useful.
Just showing off what I was getting ready to start. It should bring every file to a point where methylation and SNP calls are done. Will update with the second half soon, just wanted to get the time intensive stuff running.
Let me know if you guys have any comments/questions/concerns!
http://rpubs.com/seanb80/241330
note: all of the system commands are done as print statements, for readability, and so I can upload the thing today.
Thanks Sean!
I can prepare the MACAU Predictor file if it is helpful. Info can also be pulled from here https://github.com/hputnam/project_juvenile_geoduck_OA/blob/master/Setup_Notes/Sample_List.csv
Updated http://rpubs.com/seanb80/241330
I added links to the used genomes,
multiqc is added in (though it's at the end of the entire sequence as each sample is brought through the entire pipeline sequentially, to hopefully save a bit of disk space),
Sample ID list is updated with all 53 samples
I won't add EPI-135WG to the merged SNP file!
If my understanding of the Predictor file from the manual is correct, then it would look something like
Sample | Tank | Treatment | TimePoint |
---|---|---|---|
EPI-41 | 0 | 0 | 0 |
EPI-42 | 0 | 0 | 0 |
EPI-45 | 11 | 1 | 1 |
EPI-47 | 12 | 2 | 1 |
without any column or row names. Tank is the tank number, treatment would be coded 0, 1, 2 for, in this example case, to correspond with Field, Low, and Super Low, and time point is 0, 1, 2 for the specific time point
In the end we would have Treatment codings of 0,1,2,3,4,5,... for Field, Ambient, Low, Super Low, Ambient+Ambient, Ambient+Low, ... for all final combinations of Ambient, Low, and Super Low, correct?
Another thought on the MACAU input files I just realized.
Methyl Extractor outputs Methylated/all sites separated by watson and crick strands, whereas it looks like MACAU only wants total methylated/total site.
Would it be appropriate to sum the two strands together for the final MACAU input files?
Those changes sound good Sean. There is no need for tank number, as this is not a consistent factor across the experiment.
For Macau we will split the analysis into two different models, so we can make two different predictor files. I guess this would need two different kinship matrices... one for days 0,10,135 and one for day 145
The first statistical model will be Treatment x Timepoint for day 0, 10, and 135 samples.
The second statistical model will be Initial X Secondary for day 145 samples
So overnight we've managed to process 4 samples, and are well on our way through a fifth. This is likely going to take a while, but it's working, which is something.
Quick stab at the predictor file. I left the sample names for now to be able to match to the Methylation file later, but coding should be as it should be!
It's a relatedness matrix!
Psuedocode notebook for how this is done here
I will go ahead and close as we are fading off topic... Also might be good to create new issues under Paper repo so we can easily capture conversations?
http://people.csail.mit.edu/dnaase/bissnp2011/ https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-7-r61
http://bioinfo2.ugr.es/MethylExtract/ https://www.ncbi.nlm.nih.gov/pubmed/24627790
It looks like MethylExtract is being actively supported. Do we want to think about this for the big geoduck data set, or is Bis-SNP performing well @seanb80 ?