seb-mueller / chlamy_locus_map

Small RNA Locus Map for Chlamydomonas reinhardtii
GNU General Public License v3.0
1 stars 0 forks source link

Running Annotation #13

Open nmatthews323 opened 5 years ago

nmatthews323 commented 5 years ago

New issue so we can keep a track of running of scripts.

I'm currently test-running the annotation pipeline on the cluster, without PhaseMatch2. I've edited the size classes based on the density plot to more reasonable cut-offs. CountingBiases will also output a load of graphs from which we can modify the cut-offs to ensure they are appropriate. Been running for 20 minutes so no super obvious errors hopefully.

Once this is sorted and any bugs sorted we can run it one final time with the phasing output, then it's just clustering to do!

Which directory would you like me to run condor submissions from? Cautious of clogging up directories with logs and outputs.

Now off to the pub... have a lovely weekend!

nmatthews323 commented 5 years ago

Did a test-run of the MCA script, still lots of decisions on various parameters and settings to be made but initially looks interesting.

I'll continue tidying up this script ready for a full run-through once phasing is done.

nmatthews323 commented 5 years ago

Right, annotation pipeline has finished and looks OK. Annotated file is here: "/segmentation_2018/LociRun2018_multi200_gap100/gr_fdr0.05_41c2431.RData"

Three things I need to consult you on:

  1. Annotation query I want to rerun it as it didn't annotate the TE classes, but that correction has now been made. One quick question before we do - currently, inherited from the arabidopsis code, the predominant 5' letter is calculated as follows: https://github.com/seb-mueller/chlamy_locus_map/blob/287820d5ef585e00b0f52be41a56597b272a2f46/Scripts/chlamy_source_code.R#L460-L471 This doesn't just calculate the first base, but also continues adding predominant bases if they vary significantly from the assumed binomial distribution (or at least I think that's how it works!). You end up with a mix of annotations, some just "A" or "C" etc and others "ACG", "GA" etc. I'm inclined to say it'd be simpler just to let it calculate the predominant first base to keep it simple unless there's a specific reason to keep it like this?
  1. Determining cutoffs for annotations The density plots for the current cut-offs used are in here: [https://github.com/seb-mueller/chlamy_locus_map/tree/master/Plots/DensityPlots]() I think they look OK, but be good to get your thoughts.

  2. Determine annotations to use for the MCA. I've made a table of all the column names that we could use, and I've done a first pass of what should be used. Be good to get your thoughts: [https://github.com/seb-mueller/chlamy_locus_map/blob/master/Annotation2Use.csv]()

nmatthews323 commented 5 years ago

Little added thing, I've been going through the countingbiases function just checking everything works properly. Generally seemed ok once I'd worked through it but I did find that the repetitiveness class calculation had been updated to use classCIs in the Arabidopsis code, so I've copied this over for re-running: https://github.com/seb-mueller/chlamy_locus_map/blob/394c1935ece41cfd0fc2b11e9c34b958f3c8db53/Scripts/chlamy_source_code.R#L545-L557

Got sidelined on this so still working on MCA script. Let me know your thoughts on the things above and I'll re-run the annotation pipeline.

seb-mueller commented 5 years ago

I've numbered the 3 issues above to better refer to it:

  1. Totally agree. It used to be looking only at the first and Tom changed it, but would rather go back to the simpler form. Do you want to give it a shot?

  2. I'll have a closer look soon and update this comment

  3. That's a quite extensive selection. Nice. There is still some redundancy, e.g. ratio20vs21 and ratio20vs21Class etc. Since MCA works on categorical data, we'd have to exclude the numeric ones in favor of the categerical ones (with Class in it's name). That was in fact the whole point on setting thresholds on the density plots. I suppose that's what the PrimaryAnno column is for? If yes, the TRUE rows seem set sensibly and we can roll with it.

seb-mueller commented 5 years ago

Did a test-run of the MCA script, still lots of decisions on various parameters and settings to be made but initially looks interesting.

I'll continue tidying up this script ready for a full run-through once phasing is done.

  1. This looks really structured! A lot better than the arabidopsis one, nice! Also we can use this as confirmation to chuck out a few redundant features as for point 3. above. For example CDS, 5'UTR, 3'UTR and exons are highly correlated, we only need "exons" I'd say.
nmatthews323 commented 5 years ago

Thanks Seb:

1) I've had another look at the code, I misunderstood it - if it returns "AG" that means both A and G differ significantly from the background levels, so it should probably be kept like that, otherwise it arbitrarily returns only one of the significant bases.

2) Thanks!

3) Yep, PrimaryAnno is intended as a column for selecting which annotations to use, the SupAnno are not used for clustering but the clusters are annotated in the heatmap as to how enriched they are with that annotation.

4) Agree with removing redundancy, but shouldn't we keep the distinction between CDS and the UTRs? As differential targeting of UTRs or CDS could be kind of interesting biologically.

seb-mueller commented 5 years ago
  1. Good catch and agreed, I misunderstood too, let's keep it that way than.

  2. True, let's keep CDS and both UTRs in, but we can drop exons

nmatthews323 commented 5 years ago

Great, I'm giving a presentation this morning but will try and send off another annotation run this afternoon.

seb-mueller commented 5 years ago

Good luck! Before you run it again, I'd like to fiddle a bit with the code, have you committed all outstanding changes? Edit: In fact, I'd like to give it a go for running it to better get an for the workflow? ok?

nmatthews323 commented 5 years ago

Sorry only saw the "good luck" message and then got sidelined this afternoon. Happy for you to go ahead, all my updates have been committed. The update to the repetitiveness calculation described above and the addition of the phasing annotation are the main new additions so watch out for bugs in that. I'll focus on the MCA script when I get a chance.

seb-mueller commented 5 years ago
  1. Determining cutoffs for annotations: I've gone through the density plots (in LociRun2018_multi200_gap100_8ab6f64): Thresholds could be improved imo, for example this one should be more like 0.2 and 0.5: SmallvsNormal_0.3_0.7.pdf image

Also the strandbias could be improved, basically the threshold should be located in the valleys or the densitiy plot which usually constitute populations, which is not the case here: image

I'm currently seperating the code to seperate the heavy duty form the threshold setting so it is easier to play around. I'm running it atm and will play around with the thresholds reporting back here soon..

nmatthews323 commented 5 years ago

Looks good, agree on first one, second one needs to be symmetrical ideally. - so 0.2 and 0.8 should definitely be cut-offs for high. Then 0.4 and 0.6 might be reasonable. Ideally probably 0.42 (where the kink is) and 0.58 (where the valley is).

Probably best is to get all the graphs together for sRNA size biases, repetitiveness, strand bias, and locus size distribution then have a quick phone call to confirm the cut-offs?

Below are the plots (one raw, one log) for the size classes with the different cut-offs (0,100,400,1500,3000) as vertical lines, not sure they're that great at the moment, but also not sure where I'd put them...: image

image

I noticed there are a few very large loci:

width(gr[width(gr) > 7000]) [1] 11261 7574 7591 7469 9838 8739 9981 11029 11672 15201

Is this a problem?

seb-mueller commented 5 years ago

True. The strand-bias ones need to be symetrical, and yes, let's talk about them quickly.. Just about to test the code.. are free tomorrow? Maybe in the afternoon (anytime after 2pm)? We can then discuss the other bits above.

nmatthews323 commented 5 years ago

Yep tomorrow is good. 3pm?

On Mon, 10 Dec 2018, 4:37 pm seb-mueller <notifications@github.com wrote:

True. The strand-bias ones need to be symetrical, and yes, let's talk about them quickly.. Just about to test the code.. are free tomorrow? Maybe in the afternoon (anytime after 2pm)? We can then discuss the other bits above.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/seb-mueller/chlamy_locus_map/issues/13#issuecomment-445881937, or mute the thread https://github.com/notifications/unsubscribe-auth/AkvVh4NQqWyV0WtNMA4dB1ggS6JRhabFks5u3o3jgaJpZM4YxBLd .

seb-mueller commented 5 years ago

Had another thing and found the species size classes too complicated, e.g. hard to interpret as well as dependent on too many parameters.

Can you have a look at the last commit?
https://github.com/seb-mueller/chlamy_locus_map/commit/53849af07dfedfc3a0dfa1fe9a5b0fdeb2d4a96f

Especially around this line:

https://github.com/seb-mueller/chlamy_locus_map/blob/53849af07dfedfc3a0dfa1fe9a5b0fdeb2d4a96f/Scripts/chlamy_annotation_pipeline.R#L154

Basically, this is just looking into which sRNA sizes are predominant in each loci.

Also, I've set the threshold as you suggested for standbias and repetetivness, which now looks good I believe. Shall we try another call maybe tomorrow afternoon?

nmatthews323 commented 5 years ago

Looks good, table of predominant 5' base is very informative!

Noticed this comment, not sure what annotation it refers to though: https://github.com/seb-mueller/chlamy_locus_map/blob/53849af07dfedfc3a0dfa1fe9a5b0fdeb2d4a96f/Scripts/chlamy_annotation_pipeline.R#L181

Let's chat quickly, I'm free all afternoon just name a time.

seb-mueller commented 5 years ago

Yes, indeed, got catch, planed to discuss this with you indeed. I'd call you then around 3pm, this time for real :)

nmatthews323 commented 5 years ago

Next steps after discussion 12/12/2018: -Make density plot for phasing (Seb) -Decide if removing Loci (Seb) -Run annotation script final time (Seb) -Run MCA scripts - determine settings (Nick)

seb-mueller commented 5 years ago

Made some progress:

nmatthews323 commented 5 years ago

Thanks, looks great, agree on phasing settings.

Annotation looks good but with one catch, it didn't compute methylation as it's commented out. I think it's just this line that needs running:

https://github.com/seb-mueller/chlamy_locus_map/blob/3f6b57abf2438e5f4b2b27eb988f473dd0db9523/Scripts/chlamy_annotation_pipeline.R#L130

Could either re-run the whole thing or just make a small update to the object, not sure how best to do it with the versioning/git fingerprints though? Can I leave this with you?

seb-mueller commented 5 years ago

Ahh, that's what I forgot to ask... I did indeed take it out since I couldn't find the orginal raw data for it, do you have an idea where to find this? Also, did we already know if the methylation got back useful results in the first place, i.e. did you have a look at this before and found associations?

nmatthews323 commented 5 years ago

There were two kinds of methylation data originally. The first one was a non-specific methIP which we decided to exclude early on as it was of unclear origin and old. The other data is bisulfite sequencing derived data from which Tom called enriched loci. It's of pretty low coverage by the looks of it and the CHG and CHH have very low numbers of loci but CG looked to get some good associations last time around.

The files are in: "/projects/nick_matthews/resources/meth_data". The methylation function should (hopefully) work with this.

Around for a call if you need more info.

seb-mueller commented 5 years ago

All right! Just had a look, this folder only contains the processed gff files. To you know where the original fastq files are? Or how those gff were created anyway?

nmatthews323 commented 5 years ago

Having a look for more info now!

nmatthews323 commented 5 years ago

Ok, got more information. It's from bisulphite sequencing data produced by Andy (Bassett?). Digging in Tom's file system you can find the folder under: "/home_old/tjh48/Code/Andy_methylation"

In there it looks like Tom used yama to align the bisulphite sequencing data.

He then used segmentSeq to call loci ("processMeth_CG.R") and selected loci with FDR=0.1 ("#lociSelect.R#"). Looks like he just used the first, heuristics based functionality of segmentSeq and didn't do the baysian step (not sure it'd work with this data anyway).

Buttt I can't find the raw data for this anywhere - the fastq files referred to in "yama.R" are no longer in the folder.

seb-mueller commented 5 years ago

I've found the raw data:

sm934@node13.plantsci.internal.cam.ac.uk/projects/tjh48/Andy_methylation/fastq☺➤ls *.fastq                                                                                                                                                                              [INS]
SLX-2767-R_1.fastq  SLX-2767-R_2.fastq  SLX-2768-R_1.fastq  SLX-2768-R_2.fastq  SLX-2769-R_1.fastq  SLX-2769-R_2.fastq  SLX-2770-R_1.fastq  SLX-2770-R_2.fastq  SLX-2771-R_1.fastq  SLX-2771-R_2.fastq  SLX-2772-R_1.fastq  SLX-2772-R_2.fastq

However no information on what the conditions (meta-data) are and how they relate to our output. Do we have anything written down about it in your thesis or elsewere?

This folder also contains scripts, but no meta-data: /home_old/tjh48/Code/Andy_methylation

nmatthews323 commented 5 years ago

Niceeee, so we know that there are what looks like 6 pairs of replicates, which were then aligned and processed through segmentSeq into loci.

I don't have any specific info on conditions besides that it was bisulphite sequencing data from Chlamy. We have what looks like codes for the pipeline, not sure if that helps us at all?

seb-mueller commented 5 years ago

Ok, since it would be easy to take methylation out and we seem to have the raw data, I'll rerun it including it. We can discuss it whilst further analysing / writing up later and chuck it out if necessary.

I'll let you know when it's finished!

seb-mueller commented 5 years ago

chlamy_large_loci

Also, just had a look at one of the large loci (the first one below). Looks like a genuine one to me!

> range(gr[width(gr) > 7000])
GRanges object with 7 ranges and 0 metadata columns:
           seqnames             ranges strand
              <Rle>          <IRanges>  <Rle>
  [1]  chromosome_7 [4783679, 4794939]      *
  [2]  chromosome_8 [5026259, 5033832]      *
  [3]  chromosome_9 [2999280, 3128306]      *
  [4] chromosome_10 [5243631, 5253468]      *
  [5] chromosome_12 [1298857, 1307595]      *
  [6] chromosome_14 [4142920, 4152900]      *
  [7] chromosome_17 [4976880, 5549258]      *
seb-mueller commented 5 years ago

Another one...: chlamy_large_loci2

nmatthews323 commented 5 years ago

Ok sounds good, and genome browser files look good so we can keep those. Let me know when the annotation is done and I'll get to work on the MCA.

seb-mueller commented 5 years ago

Annotation pipeline including methylation is finished, results are in LociRun2018_multi200_gap100_90c7213. All my changes are commited and I won't touch until the new year, so feel free to have a go with the MCA :) I'll explore IGV a bit in the meanwhile thought to get a sense on our results.

nmatthews323 commented 5 years ago

Amazing, thanks Seb. I'll work through the MCA and we can catch-up after the break.

Have a great Christmas and New Year!!

seb-mueller commented 5 years ago

Just been playing around, it looks the methylation Data is actually usefull. There is a strong enrichment of CG-methoylation with loci around the centromer. It can be easily seen below (compare loci track with CG track): igv_snapshot_chlamy_cgmet_wholegenome

Zoomed in on chromosome 6: igv_snapshot_chlamy_cgmet_centromer2

Have a good xmas as well :)

nmatthews323 commented 5 years ago

Wow, that's great. There's some code at the end of the MCA script to make "chromosome tracks" so that should show-up well on there.