jbloomlab / SARS2-mut-fitness

Observed substitution counts of SARS-CoV-2 compared to those expected under the mutation rates
MIT License
19 stars 5 forks source link

ORF9b addition #21

Closed zach-hensel closed 1 year ago

zach-hensel commented 1 year ago

Great work! I made a small local hack to output fitness for ORF9b for a quick check of XBB.1.16 which has a few ORF9b mutations that might be an interesting addition.

covSpectrum query: https://cov-spectrum.org/explore/World/AllSamples/AllTimes/variants?nucMutations=T12730A%2CT28297C%2CA28447G&

Changes made:

  1. Modified gtf file and replaced download in relevant rule with copying the local modified file:
    NC_045512v2 ncbiGenes.genePred  transcript  28284   28577   .   +   .   gene_id "ORF9b.1"; transcript_id "ORF9b.1"; 
    NC_045512v2 ncbiGenes.genePred  exon    28284   28577   .   +   .   gene_id "ORF9b.1"; transcript_id "ORF9b.1"; exon_number "1"; exon_id "ORF9b.1";
    NC_045512v2 ncbiGenes.genePred  CDS 28284   28577   .   +   0   gene_id "ORF9b"; transcript_id "1"; exon_number "1"; exon_id "ORF9b.1";
    NC_045512v2 ncbiGenes.genePred  start_codon 28284   28286   .   +   0   gene_id "ORF9b"; transcript_id "ORF9b.1"; exon_number "1"; exon_id "ORF9b.1";
    NC_045512v2 ncbiGenes.genePred  stop_codon  28575   28577   .   +   0   gene_id "ORF9b"; transcript_id "ORF9b.1"; exon_number "1"; exon_id "ORF9b.1";
  2. Modified first cell in aamut_fitness.py.ipynb to not remove overlapping N and ORF9b mutations. .query("not is_overlapping or gene=='N;ORF9b'")

Result: Only examined briefly. ORF9b I5T in XBB.1.16 ranks highly.

image

jbloom commented 1 year ago

Great, thanks @zach-hensel. Marc Johnson had also been asking about ORF9b.

I am going to make a pull request that adds this into the pipeline. Just adding a few more notes re ORF9b mostly for myself while doing this:

My pull request will automate the modification of the GTF that you did manually above to place it in workflow of larger pipeline.

zach-hensel commented 1 year ago

Awesome and thank you for double checking. The nsp numbers here also come indirectly from me hastily copying some things and I overlooked the frameshift.

I am not working with this anymore so here are a couple other observations. First, nsp6 L37 would be interesting to unmask. I suppose it's masked because of a combination of occuring early and artifacts in Orf1a sequencing. Second, I made a quick script to rank mutations from the CSV one can export from a cov-spectrum query and looked at XBB.1.16. A reversion of one of the three nucleotides mutated in BA.2 for Orf6 61 popped up with a major fitness increase and you may want to mask this site as it appears to be an artifact. Lastly, it might be interesting to look at predicted fitness of RecCA and particularly for U-to-C mutations in that direction. Orf8 L84S is also interesting in that respect.

jbloom commented 1 year ago

Great, thanks. I will look at the ORF6 site. For the other masked mutations, including L84S, I had to mask all mutations to the Wuhan-Hu-1 reference as nearly all of them have unrealistically high counts that likely indicate some sort of bioinformatics issue such as calling uncovered sites to reference.

jbloom commented 1 year ago

OK, I added estimates and summarized results here: https://twitter.com/jbloom_lab/status/1636470443493449728