Open nmatthews323 opened 6 years ago
Did a test-run of the MCA script, still lots of decisions on various parameters and settings to be made but initially looks interesting.
I'll continue tidying up this script ready for a full run-through once phasing is done.
Right, annotation pipeline has finished and looks OK. Annotated file is here: "/segmentation_2018/LociRun2018_multi200_gap100/gr_fdr0.05_41c2431.RData"
Three things I need to consult you on:
Determining cutoffs for annotations The density plots for the current cut-offs used are in here: [https://github.com/seb-mueller/chlamy_locus_map/tree/master/Plots/DensityPlots]() I think they look OK, but be good to get your thoughts.
Determine annotations to use for the MCA. I've made a table of all the column names that we could use, and I've done a first pass of what should be used. Be good to get your thoughts: [https://github.com/seb-mueller/chlamy_locus_map/blob/master/Annotation2Use.csv]()
Little added thing, I've been going through the countingbiases function just checking everything works properly. Generally seemed ok once I'd worked through it but I did find that the repetitiveness class calculation had been updated to use classCIs in the Arabidopsis code, so I've copied this over for re-running: https://github.com/seb-mueller/chlamy_locus_map/blob/394c1935ece41cfd0fc2b11e9c34b958f3c8db53/Scripts/chlamy_source_code.R#L545-L557
Got sidelined on this so still working on MCA script. Let me know your thoughts on the things above and I'll re-run the annotation pipeline.
I've numbered the 3 issues above to better refer to it:
Totally agree. It used to be looking only at the first and Tom changed it, but would rather go back to the simpler form. Do you want to give it a shot?
I'll have a closer look soon and update this comment
That's a quite extensive selection. Nice. There is still some redundancy, e.g. ratio20vs21
and ratio20vs21Class
etc. Since MCA works on categorical data, we'd have to exclude the numeric ones in favor of the categerical ones (with Class in it's name). That was in fact the whole point on setting thresholds on the density plots. I suppose that's what the PrimaryAnno
column is for? If yes, the TRUE
rows seem set sensibly and we can roll with it.
Did a test-run of the MCA script, still lots of decisions on various parameters and settings to be made but initially looks interesting.
I'll continue tidying up this script ready for a full run-through once phasing is done.
Thanks Seb:
1) I've had another look at the code, I misunderstood it - if it returns "AG" that means both A and G differ significantly from the background levels, so it should probably be kept like that, otherwise it arbitrarily returns only one of the significant bases.
2) Thanks!
3) Yep, PrimaryAnno is intended as a column for selecting which annotations to use, the SupAnno are not used for clustering but the clusters are annotated in the heatmap as to how enriched they are with that annotation.
4) Agree with removing redundancy, but shouldn't we keep the distinction between CDS and the UTRs? As differential targeting of UTRs or CDS could be kind of interesting biologically.
Good catch and agreed, I misunderstood too, let's keep it that way than.
True, let's keep CDS and both UTRs in, but we can drop exons
Great, I'm giving a presentation this morning but will try and send off another annotation run this afternoon.
Good luck! Before you run it again, I'd like to fiddle a bit with the code, have you committed all outstanding changes? Edit: In fact, I'd like to give it a go for running it to better get an for the workflow? ok?
Sorry only saw the "good luck" message and then got sidelined this afternoon. Happy for you to go ahead, all my updates have been committed. The update to the repetitiveness calculation described above and the addition of the phasing annotation are the main new additions so watch out for bugs in that. I'll focus on the MCA script when I get a chance.
LociRun2018_multi200_gap100_8ab6f64
):
Thresholds could be improved imo, for example this one should be more like 0.2 and 0.5:
SmallvsNormal_0.3_0.7.pdf
Also the strandbias could be improved, basically the threshold should be located in the valleys or the densitiy plot which usually constitute populations, which is not the case here:
I'm currently seperating the code to seperate the heavy duty form the threshold setting so it is easier to play around. I'm running it atm and will play around with the thresholds reporting back here soon..
Looks good, agree on first one, second one needs to be symmetrical ideally. - so 0.2 and 0.8 should definitely be cut-offs for high. Then 0.4 and 0.6 might be reasonable. Ideally probably 0.42 (where the kink is) and 0.58 (where the valley is).
Probably best is to get all the graphs together for sRNA size biases, repetitiveness, strand bias, and locus size distribution then have a quick phone call to confirm the cut-offs?
Below are the plots (one raw, one log) for the size classes with the different cut-offs (0,100,400,1500,3000) as vertical lines, not sure they're that great at the moment, but also not sure where I'd put them...:
I noticed there are a few very large loci:
width(gr[width(gr) > 7000]) [1] 11261 7574 7591 7469 9838 8739 9981 11029 11672 15201
Is this a problem?
True. The strand-bias ones need to be symetrical, and yes, let's talk about them quickly.. Just about to test the code.. are free tomorrow? Maybe in the afternoon (anytime after 2pm)? We can then discuss the other bits above.
Yep tomorrow is good. 3pm?
On Mon, 10 Dec 2018, 4:37 pm seb-mueller <notifications@github.com wrote:
True. The strand-bias ones need to be symetrical, and yes, let's talk about them quickly.. Just about to test the code.. are free tomorrow? Maybe in the afternoon (anytime after 2pm)? We can then discuss the other bits above.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/seb-mueller/chlamy_locus_map/issues/13#issuecomment-445881937, or mute the thread https://github.com/notifications/unsubscribe-auth/AkvVh4NQqWyV0WtNMA4dB1ggS6JRhabFks5u3o3jgaJpZM4YxBLd .
Had another thing and found the species size classes too complicated, e.g. hard to interpret as well as dependent on too many parameters.
Can you have a look at the last commit?
https://github.com/seb-mueller/chlamy_locus_map/commit/53849af07dfedfc3a0dfa1fe9a5b0fdeb2d4a96f
Especially around this line:
Basically, this is just looking into which sRNA sizes are predominant in each loci.
Also, I've set the threshold as you suggested for standbias and repetetivness, which now looks good I believe. Shall we try another call maybe tomorrow afternoon?
Looks good, table of predominant 5' base is very informative!
Noticed this comment, not sure what annotation it refers to though: https://github.com/seb-mueller/chlamy_locus_map/blob/53849af07dfedfc3a0dfa1fe9a5b0fdeb2d4a96f/Scripts/chlamy_annotation_pipeline.R#L181
Let's chat quickly, I'm free all afternoon just name a time.
Yes, indeed, got catch, planed to discuss this with you indeed. I'd call you then around 3pm, this time for real :)
Next steps after discussion 12/12/2018: -Make density plot for phasing (Seb) -Decide if removing Loci (Seb) -Run annotation script final time (Seb) -Run MCA scripts - determine settings (Nick)
Made some progress:
phasing density plot phasing_densityLociRun2018_multi200_gap100.pdf I've set the threshold to (0,60, Inf) since it seperates from main peak and the rest: https://github.com/seb-mueller/chlamy_locus_map/blob/3f6b57abf2438e5f4b2b27eb988f473dd0db9523/Scripts/chlamy_annotation_pipeline.R#L182
Run annotation
Done, see results in LociRun2018_multi200_gap100_3f6b57a
corresponding to this commit:
https://github.com/seb-mueller/chlamy_locus_map/commit/3f6b57abf2438e5f4b2b27eb988f473dd0db9523
Thanks, looks great, agree on phasing settings.
Annotation looks good but with one catch, it didn't compute methylation as it's commented out. I think it's just this line that needs running:
Could either re-run the whole thing or just make a small update to the object, not sure how best to do it with the versioning/git fingerprints though? Can I leave this with you?
Ahh, that's what I forgot to ask... I did indeed take it out since I couldn't find the orginal raw data for it, do you have an idea where to find this? Also, did we already know if the methylation got back useful results in the first place, i.e. did you have a look at this before and found associations?
There were two kinds of methylation data originally. The first one was a non-specific methIP which we decided to exclude early on as it was of unclear origin and old. The other data is bisulfite sequencing derived data from which Tom called enriched loci. It's of pretty low coverage by the looks of it and the CHG and CHH have very low numbers of loci but CG looked to get some good associations last time around.
The files are in: "/projects/nick_matthews/resources/meth_data". The methylation function should (hopefully) work with this.
Around for a call if you need more info.
All right! Just had a look, this folder only contains the processed gff files. To you know where the original fastq files are? Or how those gff were created anyway?
Having a look for more info now!
Ok, got more information. It's from bisulphite sequencing data produced by Andy (Bassett?). Digging in Tom's file system you can find the folder under: "/home_old/tjh48/Code/Andy_methylation"
In there it looks like Tom used yama to align the bisulphite sequencing data.
He then used segmentSeq to call loci ("processMeth_CG.R") and selected loci with FDR=0.1 ("#lociSelect.R#"). Looks like he just used the first, heuristics based functionality of segmentSeq and didn't do the baysian step (not sure it'd work with this data anyway).
Buttt I can't find the raw data for this anywhere - the fastq files referred to in "yama.R" are no longer in the folder.
I've found the raw data:
sm934@node13.plantsci.internal.cam.ac.uk/projects/tjh48/Andy_methylation/fastq☺➤ls *.fastq [INS]
SLX-2767-R_1.fastq SLX-2767-R_2.fastq SLX-2768-R_1.fastq SLX-2768-R_2.fastq SLX-2769-R_1.fastq SLX-2769-R_2.fastq SLX-2770-R_1.fastq SLX-2770-R_2.fastq SLX-2771-R_1.fastq SLX-2771-R_2.fastq SLX-2772-R_1.fastq SLX-2772-R_2.fastq
However no information on what the conditions (meta-data) are and how they relate to our output. Do we have anything written down about it in your thesis or elsewere?
This folder also contains scripts, but no meta-data:
/home_old/tjh48/Code/Andy_methylation
Niceeee, so we know that there are what looks like 6 pairs of replicates, which were then aligned and processed through segmentSeq into loci.
I don't have any specific info on conditions besides that it was bisulphite sequencing data from Chlamy. We have what looks like codes for the pipeline, not sure if that helps us at all?
Ok, since it would be easy to take methylation out and we seem to have the raw data, I'll rerun it including it. We can discuss it whilst further analysing / writing up later and chuck it out if necessary.
I'll let you know when it's finished!
Also, just had a look at one of the large loci (the first one below). Looks like a genuine one to me!
> range(gr[width(gr) > 7000])
GRanges object with 7 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chromosome_7 [4783679, 4794939] *
[2] chromosome_8 [5026259, 5033832] *
[3] chromosome_9 [2999280, 3128306] *
[4] chromosome_10 [5243631, 5253468] *
[5] chromosome_12 [1298857, 1307595] *
[6] chromosome_14 [4142920, 4152900] *
[7] chromosome_17 [4976880, 5549258] *
Another one...:
Ok sounds good, and genome browser files look good so we can keep those. Let me know when the annotation is done and I'll get to work on the MCA.
Annotation pipeline including methylation is finished, results are in LociRun2018_multi200_gap100_90c7213
.
All my changes are commited and I won't touch until the new year, so feel free to have a go with the MCA :) I'll explore IGV a bit in the meanwhile thought to get a sense on our results.
Amazing, thanks Seb. I'll work through the MCA and we can catch-up after the break.
Have a great Christmas and New Year!!
Just been playing around, it looks the methylation Data is actually usefull. There is a strong enrichment of CG-methoylation with loci around the centromer. It can be easily seen below (compare loci track with CG track):
Zoomed in on chromosome 6:
Have a good xmas as well :)
Wow, that's great. There's some code at the end of the MCA script to make "chromosome tracks" so that should show-up well on there.
New issue so we can keep a track of running of scripts.
I'm currently test-running the annotation pipeline on the cluster, without PhaseMatch2. I've edited the size classes based on the density plot to more reasonable cut-offs. CountingBiases will also output a load of graphs from which we can modify the cut-offs to ensure they are appropriate. Been running for 20 minutes so no super obvious errors hopefully.
Once this is sorted and any bugs sorted we can run it one final time with the phasing output, then it's just clustering to do!
Which directory would you like me to run condor submissions from? Cautious of clogging up directories with logs and outputs.
Now off to the pub... have a lovely weekend!