sr320 / LabDocs

Roberts Lab Documents
http://sr320.github.io/LabDocs/
9 stars 17 forks source link

Bis-SNP vs MethylExtract? #421

Closed hputnam closed 7 years ago

hputnam commented 7 years ago

http://people.csail.mit.edu/dnaase/bissnp2011/ https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-7-r61

http://bioinfo2.ugr.es/MethylExtract/ https://www.ncbi.nlm.nih.gov/pubmed/24627790

It looks like MethylExtract is being actively supported. Do we want to think about this for the big geoduck data set, or is Bis-SNP performing well @seanb80 ?

seanb80 commented 7 years ago

I was just writing something up, but I can copy it here! As far as MethylExtract, it works, but it seems to output SNP and CG Methylation files for each individual scaffold, resulting in lots of files. I'm working on a way to manage those now, but the active support seems to be a good thing!

seanb80 commented 7 years ago

The new geoduck data came in today and I wanted to write down the outline for the process I was imagining for the analysis.

  1. Trim Galore. No RRBS option, paired end reads. Default everything else

  2. Bismark Genome Prep. Using the 10k-Scaff file.

  3. Bismark. 1 allowed mismatch (--n 1) (Bismark had better mapping efficiency for 100bp reads than BSMap, and seems to run faster than BSMap, along with being more palatable for downstream analysis)

  4. Methyl Extract with both Methylation and SNP calls. Arguments for SNP calling are minQuality = 30 (minQ =30), Min Nucleotide Frequency = 0.05 (varFraction=0.05), alpha level 0.01 (maxPval=0.01). For Paired end reads I'm using Watson sam flags 99 and 147 (flagW=99,147) and Crick sam flags 83 and 163 (flagC=83,163). Also getting wig and bed files output.

  5. From here, we need to get a relatedness matrix for the MACAU model. I believe we can use either vcftools with the --relatedness argument, or a program called PLINK, but I'm in the midst of reading about this so cannot be 100% for sure.

  6. Once we have the Relatedness matrix, the MACAU model should give us significantly differentially methylated regions by going though and looking at all regions via their model.

  7. After we have DMRs, where do we want to go?

hputnam commented 7 years ago

Is it the 70K file for geoduck? https://github.com/laurahspencer/546-Bioinformatics/tree/master/2016-10_Geo-Ann-Project

Also will you be running the lambda phage genome with steps 1-4?

@seanb80 @sr320 could we schedule a meeting to discuss pipeline and model?

sr320 commented 7 years ago

How about Wednesday for meeting? On Mon, Jan 9, 2017 at 6:27 PM Hollie Putnam notifications@github.com wrote:

Is it the 70K file for geoduck? https://github.com/laurahspencer/546-Bioinformatics/tree/master/2016-10_Geo-Ann-Project

Also will you be running the lambda phage genome with steps 1-4?

@seanb80 https://github.com/seanb80 @sr320 https://github.com/sr320 could we schedule a meeting to discuss pipeline and model?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/421#issuecomment-271468964, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHtxeMHERubdv_zKqvEbwaM3QFbTU4ks5rQuxxgaJpZM4LexpX .

sr320 commented 7 years ago

Yes Laura did 70k - I think I started the 10k annotations

On Mon, Jan 9, 2017 at 6:27 PM Hollie Putnam notifications@github.com wrote:

Is it the 70K file for geoduck? https://github.com/laurahspencer/546-Bioinformatics/tree/master/2016-10_Geo-Ann-Project

Also will you be running the lambda phage genome with steps 1-4?

@seanb80 https://github.com/seanb80 @sr320 https://github.com/sr320 could we schedule a meeting to discuss pipeline and model?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/421#issuecomment-271468964, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHtxeMHERubdv_zKqvEbwaM3QFbTU4ks5rQuxxgaJpZM4LexpX .

hputnam commented 7 years ago

Yes, can do! After lab meeting?

laurahspencer commented 7 years ago

FYI Hollie all the final files are in the results folder on my 546 repo.

sr320 commented 7 years ago

Sounds good - will do after lab meeting, On Mon, Jan 9, 2017 at 6:39 PM Hollie Putnam notifications@github.com wrote:

Yes, can do! After lab meeting?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/421#issuecomment-271470645, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHtyhSbAdN9IFrR52T6SuhhTkCW5xRks5rQu8lgaJpZM4LexpX .

seanb80 commented 7 years ago

I was planning on running steps 2-4 with both the lambda and the geoduck. Sorry I didn't specify that!

seanb80 commented 7 years ago

Post regarding MethylExtract for discussion at the meeting is here

Upshot: Something is weird with Methyl Extract and it's not completing it's run when using an input directory with all sample files, so I'm running it with a single file to see if that works better/differently. Also going to get BisSNP going as well, as that may be a functioning option.

hputnam commented 7 years ago

Was the same reference genome used for Bismark and MethylExtract? I noticed some differences in scaffold numbers between the CoGe data and Bismark data, so was thinking different files may have been mapped to...

seanb80 commented 7 years ago

Yes, the same reference file was used.

seanb80 commented 7 years ago

methyextractsuccess

Looks like it will work on a single file, just not multiple. I think this is caused by an issue with how they form the hashmap they search over with a multi-fasta file, but my perl isn't super great so it's just a guess.

seanb80 commented 7 years ago

Output for MethylExtract on EPI-103 found here.

Warning, the .zip file in there contains a .sam file for every scaffold in the reference genome, and has ~12,500 files in it. I copied the summary files to the directory, which may be more useful.

seanb80 commented 7 years ago

Just showing off what I was getting ready to start. It should bring every file to a point where methylation and SNP calls are done. Will update with the second half soon, just wanted to get the time intensive stuff running.

Let me know if you guys have any comments/questions/concerns!

http://rpubs.com/seanb80/241330

note: all of the system commands are done as print statements, for readability, and so I can upload the thing today.

hputnam commented 7 years ago

Thanks Sean!

I can prepare the MACAU Predictor file if it is helpful. Info can also be pulled from here https://github.com/hputnam/project_juvenile_geoduck_OA/blob/master/Setup_Notes/Sample_List.csv

seanb80 commented 7 years ago

Updated http://rpubs.com/seanb80/241330

If my understanding of the Predictor file from the manual is correct, then it would look something like

Sample Tank Treatment TimePoint
EPI-41 0 0 0
EPI-42 0 0 0
EPI-45 11 1 1
EPI-47 12 2 1

without any column or row names. Tank is the tank number, treatment would be coded 0, 1, 2 for, in this example case, to correspond with Field, Low, and Super Low, and time point is 0, 1, 2 for the specific time point

In the end we would have Treatment codings of 0,1,2,3,4,5,... for Field, Ambient, Low, Super Low, Ambient+Ambient, Ambient+Low, ... for all final combinations of Ambient, Low, and Super Low, correct?

seanb80 commented 7 years ago

Another thought on the MACAU input files I just realized.

Methyl Extractor outputs Methylated/all sites separated by watson and crick strands, whereas it looks like MACAU only wants total methylated/total site.

Would it be appropriate to sum the two strands together for the final MACAU input files?

hputnam commented 7 years ago

Those changes sound good Sean. There is no need for tank number, as this is not a consistent factor across the experiment.

For Macau we will split the analysis into two different models, so we can make two different predictor files. I guess this would need two different kinship matrices... one for days 0,10,135 and one for day 145

seanb80 commented 7 years ago

So overnight we've managed to process 4 samples, and are well on our way through a fifth. This is likely going to take a while, but it's working, which is something.

seanb80 commented 7 years ago

Predictor.xlsx

Quick stab at the predictor file. I left the sample names for now to be able to match to the Methylation file later, but coding should be as it should be!

seanb80 commented 7 years ago

It's a relatedness matrix!

relatedness

Psuedocode notebook for how this is done here

sr320 commented 7 years ago

I will go ahead and close as we are fading off topic... Also might be good to create new issues under Paper repo so we can easily capture conversations?