hollygene / CornellPostdoc

Various scripts I used during my tenure as a Postdoctoral Associate at Cornell University
0 stars 0 forks source link

Microbial GWAS - want to correlate resistance phenotypes with genotypes/SNPs/other mutations #9

Open hollygene opened 3 years ago

hollygene commented 3 years ago

With help from Kristina: Roary Panaroo was ran on the 601 sequences we have phenotypic data for Scoary was then run on the Roary Panaroo output + a tree generated by IQTree

Scoary outputs a separate .csv file for each antibiotic (phenotypic class)

hollygene commented 3 years ago

Screen Shot 2021-04-21 at 4 39 32 PM Example of what this .csv file looks like

hollygene commented 3 years ago

So what do we want to know, and how can we answer these questions?

  1. Which genes are known already to be associated with resistance, and which are novel? Do we have any genes that are associated with one antibiotic in our dataset but a different antibiotic in the databases, or vice versa?
  2. How many genes are significantly associated with antibiotic resistance, per antibiotic class tested?
  3. How many genes are associated with more than one antibiotic/what are they?
  4. What are the functional annotations of the significantly associated genes?
  5. *this will likely need to be combined with treeWAS results - Allele frequency assoc with each resistance gene?
hollygene commented 3 years ago

For question 1:

Which genes are known already to be associated with resistance, and which are novel? Do we have any genes that are associated with one antibiotic in our dataset but a different antibiotic in the databases, or vice versa? Need: scoary output + AMRFinder results + maybe other databases?

- using our AMRFinder results as a "reference," grep the column Non_unique_gene_name and get non-matches

Get list of gene IDs, and search AMRFinder/other databases for these IDs

hollygene commented 3 years ago

Need to get gene IDs of accessory genes:

  1. find all protein sequences for acc genes, select one representative for each gene
  2. write multi fasta file for the protein sequences
  3. use blast-p to assign gene IDs to each protein seq
  4. use blast2go or PANTHER for functional annotation
hollygene commented 3 years ago

odds ratio: whether it is correlated with 1 (resistant) or 0 (susceptible) greater than 1: significantly associated with resistance less than 1: significantly associated with susceptibility

hollygene commented 3 years ago
  1. get protein sequences for all accessory genes
  2. write multi fasta file for the protein sequences

Used script from Kristina: https://github.com/hollygene/CornellPostdoc/blob/9dfb02780e0c640dcc6758e847f5b5265adaea64/panaroo_protein_fasta_out_kristina.R#L1

Input:

"/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/gene_data.csv"
"/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/gene_presence_absence_roary.csv"

Output: /Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/dogEcoli_acc_proteins_out.fasta

hollygene commented 3 years ago
  1. use blast-p to assign gene IDs to each protein sequence

https://github.com/hollygene/CornellPostdoc/blob/9dfb02780e0c640dcc6758e847f5b5265adaea64/blastp.sh#L1

Input: dogEcoli_acc_proteins_out.fasta (from R script above) Output: dog_verified_host_prots_tab.out (in tab-delimited format)

hollygene commented 3 years ago

blastp specific command: https://github.com/hollygene/CornellPostdoc/blob/41c55b7b4f4e5378a4465d70d1dd5ba0bf2e836f/blastp.sh#L11-L12

https://github.com/hollygene/CornellPostdoc/blob/41c55b7b4f4e5378a4465d70d1dd5ba0bf2e836f/blastp.sh#L14-L29

Then filter in bash

hollygene commented 3 years ago

Bash filtering: Wanted to get the best hit for each unique gene sort by field one and field two (numeric, reverse) so that min for each key will be top of the group, pick the first for each key by the second sort. https://github.com/hollygene/CornellPostdoc/blob/41c55b7b4f4e5378a4465d70d1dd5ba0bf2e836f/blastp.sh#L33

hollygene commented 3 years ago

From this file, I took the first two columns (qseqid and sseqid) and pasted them into Excel. I used text to columns to separate sseqid by _, then deleted everything except gene ID and species. I then renamed the columns "PanGene" "GeneID" and "Org"

hollygene commented 3 years ago

I actually redid this a different way because I wasn't confident that the first way I did it was correct

so I did this:

sort -k1,1 -k15,15nr -k14,14n dog_verified_host_prots_tab_more.out > test1.txt
sort -u -k1,1 test1.txt > test.txt

The first sort orders the blast output by query name then by the 12th column in descending order (bit score - I think), then by 11th column ascending (evalue I think). The second sort picks the first line from each query. Obviously you can skip the first sort if the output is already sorted in the 'correct' order.

hollygene commented 3 years ago

To analyze Scoary output, I'm using R

I first loaded in all of the Scoary output .csv files into one list in R https://github.com/hollygene/CornellPostdoc/blob/c1e8c447b6991f891b1b452b12679c91ceecc062/scoaryViz.Rmd#L22

I then filtered by Empirical p value (indicating the gene is significantly associated with something) cutoff <0.05 https://github.com/hollygene/CornellPostdoc/blob/c1e8c447b6991f891b1b452b12679c91ceecc062/scoaryViz.Rmd#L54

Then I filtered based on Odds ratio > 1 (indicating the gene is associated with resistance) https://github.com/hollygene/CornellPostdoc/blob/c1e8c447b6991f891b1b452b12679c91ceecc062/scoaryViz.Rmd#L55

hollygene commented 3 years ago

I created a function to take an antibiotic as input and spit out a fasta file of all of the nucleotide sequences of the genes that are significantly associated with resistance

https://github.com/hollygene/CornellPostdoc/blob/c1e8c447b6991f891b1b452b12679c91ceecc062/scoaryViz.Rmd#L148-L152

hollygene commented 3 years ago

We found that several of the antibiotics tested had no significantly associated genes from Scoary and the reason behind this was the sample size was too low.

We quantified the sample sizes for each antibiotic: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Antibiotic | Phenotypic Datapoints | # Sig Assoc Genes from Scoary -- | -- | -- Oxacillin.INT | 1 | - Polymyxin.B.INT | 1 | - Amoxicillin.INT | 9 | 0 Penicillin.G.INT | 9 | 0 Oxacillin...2..NaCl.INT | 10 | 0 Penicillin.INT | 11 | 0 Neomycin.INT | 16 | 0 Piperacillin.INT | 16 | 117 Tobramycin.INT | 17 | 34 Nitrofurantoin.INT | 25 | 0 Clindamycin.INT | 27 | 0 Erythromycin.INT | 34 | 0 Ceftiofur.INT | 37 | 278 Cephalexin.INT | 37 | 2 Ticarcillin.Clavulanic.Acid.INT | 48 | 220 Ticarcillin.INT | 51 | 160 Cephalothin.INT | 63 | 76 Cefoxitin.INT | 96 | 216 Pradofloxacin.INT | 150 | 427 Cefovecin.INT | 272 | 680 Cefalexin.INT | 277 | 619 Ceftazidime.INT | 323 | 424 Piperacillin.Tazobactam.INT | 346 | 157 Orbifloxacin.INT | 387 | 649 Doxycycline.INT | 416 | 895 Marbofloxacin.INT | 445 | 781 Cefpodoxime.INT | 448 | 780 Chloramphenicol.INT | 452 | 434 Cefazolin.INT | 461 | 816 Imipenem.INT | 474 | 235 Amikacin | 497 | 312 Enrofloxacin.INT | 503 | 610 Ampicillin.INT | 509 | 894 Tetracycline.INT | 527 | 1117 Amoxicillin.Clavulanic.Acid.INT | 531 | 750 Gentamicin.INT | 560 | 573 Trimethoprim.Sulfamethoxazole.INT | 586 | 704

hollygene commented 3 years ago

From this, we decided that our cutoff would be 100 samples per antibiotic minimum, that leaves us with 19 antibiotics

hollygene commented 3 years ago

Cefalexin and Cephalexin are both included in the antibiotics, but actually are the same thing just different spellings, so I combined them

hollygene commented 3 years ago

Want to find what common genes were found between Scoary and AMRFinder, so using the AMRFinder output from the Bioprojects

4

But, several samples were missing from the original output of that (57 samples) So created a list of those accession numbers and pulled them from NCBI Ran AMRFinder on those

Now concatenating that output with the original AMRFinder output (the 554 samples we did have) to get a final list of AMRFinder output for comparison with Scoary outputs

hollygene commented 3 years ago

One sample was left off because it didn't have an accession number This seems to be because it had been through several sequencing rounds

The sample:

6 3 PA-Ryan 2018-07-09 5135313   HGAP v. 4.0   Complete Genome   562 NA NA NA NA 562         4857938   3.8.4 PRJNA324573 COMBINED 2020-06-11.1 PDG000000004.2233 capU=COMPLETE,espX1=COMPLETE,fdeC=COMPLETE,iss=COMPLETE,sslE=HMM,ybtP=COMPLETE,ybtQ=COMPLETE Escherichia coli NA Canis lupus familiaris aac(3)-IId=COMPLETE,aac(6')-Ib-cr5=COMPLETE,aadA22=COMPLETE,aadA2=COMPLETE,aadA5=COMPLETE,blaCTX-M-15=COMPLETE,blaEC=COMPLETE,blaNDM-5=COMPLETE,blaOXA-1=COMPLETE,blaTEM-1=COMPLETE,ble=COMPLETE,catB3=PARTIAL,dfrA12=COMPLETE,dfrA17=COMPLETE,floR=COMPLETE,gyrA_D87N=POINT,gyrA_S83L=POINT,mph(A)=COMPLETE,parC_S80I=POINT,parE_S458A=POINT,sul1=COMPLETE,tet(A)=COMPLETE ECOL-18-VL-LA-PA-Ryan-0026 NA PDT000545912.1 2019-07-16T12:52:12Z USA:PA trach wash environmental/other PDS000046501.12 2 11 SAMN11230749 GCA_007012305.1 aac(3)-IId=COMPLETE,aac(6')-Ib-cr5=COMPLETE,aadA22=COMPLETE,aadA2=COMPLETE,aadA5=COMPLETE,acrF=COMPLETE,blaCTX-M-15=COMPLETE,blaEC=COMPLETE,blaNDM-5=COMPLETE,blaOXA-1=COMPLETE,blaTEM-1=COMPLETE,ble=COMPLETE,catB3=PARTIAL,dfrA12=COMPLETE,dfrA17=COMPLETE,floR=COMPLETE,gyrA_D87N=POINT,gyrA_S83L=POINT,mdtM=COMPLETE,mph(A)=COMPLETE,parC_S80I=POINT,parE_S458A=POINT,sul1=COMPLETE,tet(A)=COMPLETE GCA_007012305.1_ASM701230v1_genomic_prokka

The associated links to Bioprojects and accessions: https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=11230749

The one I used: https://www.ncbi.nlm.nih.gov/sra/SRX5557610[accn]

I chose this one because it was a MiSeq run, not PacBio

hollygene commented 3 years ago

I manually added a column to the .csv file with the Assembly (GCA_007012305.1) so that I could match it to the other file I already had in R

hollygene commented 3 years ago

Need to convert between IDs that panaroo assigns to more useful Gene IDs that can be used in a reference database Probably want more than one type of identifier - maybe gene symbol and refseq acc

hollygene commented 3 years ago

For getting gene identifiers:

  1. use Kristina's script to get a fasta file of all protein sequences from panaroo output https://github.com/hollygene/CornellPostdoc/blob/e242d67cdf70e598e254d8fe1c9a6e88eb8d8d34/panaroo_protein_fasta_out_kristina.R
  2. take this fasta file and throw it into blastp with the following parameters: blastp -outfmt "6 qseqid sseqid sacc sgi qaccver pident ssciname length mismatch gapopen qstart qend sstart send evalue bitscore" \ -query /workdir/hcm59/Ecoli/SNPs/dog_verified_host/ecoli_all_proteins_out.fasta -db nr -out ./ecoli_all_prot_sci_Name.out -num_threads 36 -max_target_seqs 5
  3. make sure the output is sorted by best matches for each protein
  4. filter out the first option for each protein: awk -F"\t" '!_[$1]++' ecoli_acc_proteins_blastp_accVer.out > ecoli_acc_proteins_blastp_accVer_unique.out
  5. use this output as a gene key in R