rasilab / github_demo

Demo GitHub repo for laboratory research
MIT License
5 stars 1 forks source link

Analyze yeast cytosolic 8x dicodon screen data #3

Open kychen37 opened 2 years ago

kychen37 commented 2 years ago

GanttStart: 2022-03-29

Background

Transformed at high efficiency using Bloom lab's liquid recovery protocol: https://github.com/rasilab/rqc_aggregation_aging/issues/98 Library prepped: https://github.com/rasilab/rqc_aggregation_aging/issues/104

Ideas TODO

Analysis Links

Brief conclusion

kychen37 commented 2 years ago

@rasi

rasi commented 2 years ago

@kychen37 Ok. I remember you showing some data in the short updates, but I don't remember ever us discussing this. Can you update the issue whenever you finish some analysis (I think you used to do this previously?) I can give more useful feedback if I have some time to think about your results than just looking at them for the first time during group meeting.

kychen37 commented 2 years ago

@rasi alignment_statistics

Code: https://github.com/rasilab/rqc_aggregation_aging/blob/master/analysis/deepseq/20220329_exp51_cyto_8x_dicodon/scripts/plot_alignment_statistics.ipynb

kychen37 commented 2 years ago

Prelim analysis, just making a heatmap and starting to look at inserts that are still missing despite good representation:

Code: https://github.com/rasilab/rqc_aggregation_aging/blob/master/analysis/deepseq/20220329_exp51_cyto_8x_dicodon/scripts/plot_insert_mrna_levels.ipynb

kychen37 commented 2 years ago

Note to self: my branches are kind of messed up right now, my most recent analyses were done in master (I had switched to doing these analyses in deepseq and forgot, then continued in master). Pulling from master didn't update the deepseq branch, I may need to 'sync' or something, for now the master branch is the most up to date for the scripts of this analysis only

kychen37 commented 2 years ago

@rasi Update I will revisit this data next week to see what is pertinent for my committee meeting

rasi commented 2 years ago

@kychen37 Post the figures as Issue comments so that we can come back to them easily (the commit links above are enough to figure out where the figure is).

Why is the data not centered around 0 unlike here: https://github.com/rasilab/lab_analysis_code/blob/master/rasi/analysis/deepseq/20210703_pb_8xdicodon_resequence_grna_mrna/scripts/plot_human_8xdicodon_effects_files/figure-gfm/unnamed-chunk-6-1.png?

kychen37 commented 2 years ago

@rasi I've been trying to figure that out but I'm not sure. Could it have to do with the fact that the mean of all the dipeptide lfcs is slightly negative?

rasi commented 2 years ago

@kychen37 The mean and median cannot be this different. First, calculate simple log fold change without bootstrapping and make sure you understand what is going on before doing the bootstrap. The bootstrap is just for error bars.

kychen37 commented 2 years ago

@rasi sorry I forgot that my lfc column had already been median normalized when I calculated the median and said it was zero. The median is actually -1.5 and mean is -1.7

kychen37 commented 2 years ago

After median-normalization, the median = 0 and mean = -0.19

rasi commented 2 years ago

Either way, the above plot should be approximately centered around 0 if everything is done correctly.

rasi commented 2 years ago

Can you also add horizontal lines for each amino acid similar to what is in Phil's paper?

kychen37 commented 2 years ago

@rasi I replotted the dipeptide heatmap using your code because I realized I wasn't looking at missing dipeptides correctly, turns out there is a group of dipeptides that are missing (in grey):

kychen37 commented 2 years ago

@rasi Does bootstrapping involve normalizing lfcs to the mean (I thought it didn't)? My data appears to skew negative (unnormalized lfcs range from -5 to 0.1), and the bootstrap function takes from these. Should I have it take from median-normalized lfcs instead (which are centered on zero)? It didn't look like you had done that in Phil's data so I didn't do it for mine

rasi commented 2 years ago

@kychen37 Ok, your explanation makes sense. It looks like I did not median normalize because my mRNA and gRNA had almost equal counts. But it will be good to do it for yours since your read counts are skewed.

rasi commented 2 years ago

@kychen37 Most of the missing dipeptides are hydrophobic or bulky. Maybe these are just toxic as you noticed in your plasmid library. Can you calculate the bootstrap error bars for the remaining dipeptides, so that you can highlight ones that are atleast 2xSD below median value?

Also, can you make the above plot for all codons? I assume Arg is so destabilizing primarily because it has two rare codons CGG and CGA, but this will be good to see.

kychen37 commented 2 years ago

@rasi I went with a mean-normalization since the bootstrap used mean so I figured I would keep it consistent:

code

rasi commented 2 years ago

The codon plot looks good! Will be useful to spread out the Y-axis a bit.

rasi commented 2 years ago

I recommend making frame plots similar to 1E in Phil's paper and a schematic that mirrors 1A as closely as possible.

kychen37 commented 2 years ago

@kychen37 Most of the missing dipeptides are hydrophobic or bulky. Maybe these are just toxic as you noticed in your plasmid library.

@rasi I'm trying to figure out what the effect of slice(1) is in the human code shown here:

Screen Shot 2022-06-06 at 10 28 06 PM

The way I had plotted dipeptides before (which did not include slice(1) and associated grouping) I get less missing dipeptides than when I include it, I directly compared the two methods here. I think I'm just not really understanding what the slice function is doing to the data and if I need it?

rasi commented 2 years ago

@kychen37: slice(1) takes the first element of each group (https://dplyr.tidyverse.org/reference/slice.html).

It should not drop any dipeptides since you are grouping by dipeptides in the previous step.

kychen37 commented 2 years ago

@rasi

code

rasi commented 2 years ago

Looks good and better than what I might have guessed :-) It will be useful to carefully look at the off-diagonal elements of the above plot to see whether it is noise or something interesting.

Try the same frame plot for codon pairs as well.

Look through the language in Phil's paper and see whether you can place the observations above in the context of what is known about translation in yeast (lot more than in human).

See some of my TODO's at the top comment. Add to them if you think of additional experiments to do (this can be a useful checklist to come back to as you dig deeper).

kychen37 commented 2 years ago

@rasi bootstrapped error bars for dipeptide levels of just the VK/FK dipeptides + some controls. I'm going to show this one because the full dipeptide plot is too big

rasi commented 2 years ago

Sounds good. Add end caps to the error bars to make it a bit easier to see. I would still start showing the full heat map from https://github.com/rasilab/github_demo/issues/3 after you show the single codon and single aa effects.

kychen37 commented 2 years ago

Breakdown of dicodons that are missing from this sequencing run, code

library_missing_from n_dicodons_missing possible reason for the dicodon being missing
gdna only 0
mrna only 108 since it's present in gdna and linkageseq, these mrnas may be severely destabilized, or mrna library was bottlenecked
gdna and mrna only 141 since it's present in linkageseq but not gdna or mrna, this implies either toxicity or bottlenecking
gdna, mrna, and linkage 180 something to do with oligo pool synthesis or cloning of the plasmid library, maybe bottlenecked at e.coli step?
TOTAL 429

For the 180 dicodons that are missing from linkage sequencing, this is the dipeptide breakdown (what proportion of inserts for each dipeptide were missing in linkage sequencing, top ~20):

Screen Shot 2022-06-10 at 3 29 56 PM

For the 180 dicodons that are missing from linkage sequencing of the integrating plasmid library, 130 of these dicodons are also missing in the dual-barcode 2um plasmid library. The dipeptide breakdown of these 130 common missing dicodons (top ~20):

Screen Shot 2022-06-10 at 3 33 04 PM

@rasi

kychen37 commented 2 years ago

Update

@rasi

Next week I will look into endogenous inserts from this sequencing run and plan remaining experiments + paper

kychen37 commented 2 years ago

CODE

Remake Fig1A from Presnyak 2015 to make sure the CSCs I calculated matched there's.

Correlating the Coller lab CSCs with my LFCs

Plotting all data together

kychen37 commented 2 years ago

8x dicodon median-normalized codon average LFCs color-coded by Coller designations of 'unstable' vs 'stable' (aka CSC < 0 or > 0)

Correlating AASC with booststrapped average amino acid LFC

kychen37 commented 2 years ago

@rasi there is a moderate correlation between Coller lab CSCs and my average codon LFCs, there is basically no correlation between Coller lab AASCs and my average amino acid effects though, see comments above

phiburke commented 2 years ago

@Katharine Just wanted to say, that colored codon plot looks really good.

I wouldn't worry too much about the amino acid level effects not matching up well; AA-level effects in your library are probably driven more by localized repeats, whereas they're looking at transcriptome-wide aggregate effect measurements, so it's a very different thing.

Also, if you're doing literature comparisons you should totally compare your data to Gamble2016, Adjacent Codons Act in Concert to Modulate Translation Efficiency in Yeast from Stan Fields' group. That's probably the closest published experimental measurement to what you've done. Also, they only look at protein levels, so even if things don't match up well it might still be interesting (eg. some dicodons effect on mRNA but not protein level).

kychen37 commented 2 years ago

Thanks @phiburke ! My memory of Gamble2016 is that they just found a bunch of rare codons but it's been a while so definitely worth a re-look!

kychen37 commented 2 years ago

Update @rasi In addition to plots above, I am still interested in trying to correlate LFCs with tAI, CAI, and ribosome pause scores.

kychen37 commented 2 years ago

Update & TODO

@rasi

kychen37 commented 2 years ago

@rasi Update

kychen37 commented 2 years ago
kychen37 commented 2 years ago

Update

@rasi

kychen37 commented 2 years ago

Update

kychen37 commented 1 year ago

Summary of different types of data filtering

Read cutoffs based on insert & barcode-level plots

Read cutoffs using Burke2022 code

Screenshot 2022-12-22 at 1 33 59 PM

Minimal read cutoffs -> bootstrap by dipeptide -> filter by SD

Screenshot 2022-12-22 at 1 34 13 PM

calc_lfc_bootstrap <- function(data, indices) { d <- data[indices,] log2(sum(d$barcode_count_mrna)) - log2(sum(d$barcode_count_gdna)) }

wt_diaa_bootstrap_diaa <- wt_dicodon %>% filter(sample_name == 'wt') %>% group_by(diaa) %>% nest() %>% mutate(lfc_boot = map(data, function(df) boot::boot(data=df, statistic=calc_lfc_bootstrap, R=100)$t)) %>% select(-data) %>% mutate(lfc = map_dbl(lfc_boot, mean)) %>% mutate(lfc_sd = map_dbl(lfc_boot, sd)) %>% select(-lfc_boot) %>% ungroup() %>% mutate(lfc_mean = lfc - mean(lfc)) %>% mutate(lfc_med = lfc - median(lfc)) %>% mutate(lfc_max = lfc - max(lfc)) %>% mutate(strain = 'wt') %>% filter(lfc_sd <= 0.25)



@rasi let me know if you have any thoughts on the proper/best way to plot this
kychen37 commented 1 year ago

Destabilized dipeptides vs frameshifts

kychen37 commented 1 year ago

@rasi

code

  • median-normalized frame effects if we want it zero-centered:
  • not normalized: