Open hollygene opened 3 years ago
Example of what this .csv file looks like
So what do we want to know, and how can we answer these questions?
For question 1:
Which genes are known already to be associated with resistance, and which are novel? Do we have any genes that are associated with one antibiotic in our dataset but a different antibiotic in the databases, or vice versa? Need: scoary output + AMRFinder results + maybe other databases?
- using our AMRFinder results as a "reference," grep
the column Non_unique_gene_name
and get non-matches
Get list of gene IDs, and search AMRFinder/other databases for these IDs
Need to get gene IDs of accessory genes:
odds ratio: whether it is correlated with 1 (resistant) or 0 (susceptible) greater than 1: significantly associated with resistance less than 1: significantly associated with susceptibility
Used script from Kristina: https://github.com/hollygene/CornellPostdoc/blob/9dfb02780e0c640dcc6758e847f5b5265adaea64/panaroo_protein_fasta_out_kristina.R#L1
Input:
"/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/gene_data.csv"
"/Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/gene_presence_absence_roary.csv"
Output: /Users/hcm59/Box/Goodman\ Lab/Projects/bacterial\ genomics/Ecoli_dog_AMR_results/dog_verified_host/dogEcoli_acc_proteins_out.fasta
Input: dogEcoli_acc_proteins_out.fasta
(from R script above)
Output: dog_verified_host_prots_tab.out
(in tab-delimited format)
Bash filtering: Wanted to get the best hit for each unique gene sort by field one and field two (numeric, reverse) so that min for each key will be top of the group, pick the first for each key by the second sort. https://github.com/hollygene/CornellPostdoc/blob/41c55b7b4f4e5378a4465d70d1dd5ba0bf2e836f/blastp.sh#L33
From this file, I took the first two columns (qseqid and sseqid) and pasted them into Excel. I used text to columns to separate sseqid by _, then deleted everything except gene ID and species. I then renamed the columns "PanGene" "GeneID" and "Org"
I actually redid this a different way because I wasn't confident that the first way I did it was correct
so I did this:
sort -k1,1 -k15,15nr -k14,14n dog_verified_host_prots_tab_more.out > test1.txt
sort -u -k1,1 test1.txt > test.txt
The first sort orders the blast output by query name then by the 12th column in descending order (bit score - I think), then by 11th column ascending (evalue I think). The second sort picks the first line from each query. Obviously you can skip the first sort if the output is already sorted in the 'correct' order.
To analyze Scoary output, I'm using R
I first loaded in all of the Scoary output .csv files into one list in R https://github.com/hollygene/CornellPostdoc/blob/c1e8c447b6991f891b1b452b12679c91ceecc062/scoaryViz.Rmd#L22
I then filtered by Empirical p value (indicating the gene is significantly associated with something) cutoff <0.05 https://github.com/hollygene/CornellPostdoc/blob/c1e8c447b6991f891b1b452b12679c91ceecc062/scoaryViz.Rmd#L54
Then I filtered based on Odds ratio > 1 (indicating the gene is associated with resistance) https://github.com/hollygene/CornellPostdoc/blob/c1e8c447b6991f891b1b452b12679c91ceecc062/scoaryViz.Rmd#L55
I created a function to take an antibiotic as input and spit out a fasta file of all of the nucleotide sequences of the genes that are significantly associated with resistance
We found that several of the antibiotics tested had no significantly associated genes from Scoary and the reason behind this was the sample size was too low.
We quantified the sample sizes for each antibiotic: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Antibiotic | Phenotypic Datapoints | # Sig Assoc Genes from Scoary -- | -- | -- Oxacillin.INT | 1 | - Polymyxin.B.INT | 1 | - Amoxicillin.INT | 9 | 0 Penicillin.G.INT | 9 | 0 Oxacillin...2..NaCl.INT | 10 | 0 Penicillin.INT | 11 | 0 Neomycin.INT | 16 | 0 Piperacillin.INT | 16 | 117 Tobramycin.INT | 17 | 34 Nitrofurantoin.INT | 25 | 0 Clindamycin.INT | 27 | 0 Erythromycin.INT | 34 | 0 Ceftiofur.INT | 37 | 278 Cephalexin.INT | 37 | 2 Ticarcillin.Clavulanic.Acid.INT | 48 | 220 Ticarcillin.INT | 51 | 160 Cephalothin.INT | 63 | 76 Cefoxitin.INT | 96 | 216 Pradofloxacin.INT | 150 | 427 Cefovecin.INT | 272 | 680 Cefalexin.INT | 277 | 619 Ceftazidime.INT | 323 | 424 Piperacillin.Tazobactam.INT | 346 | 157 Orbifloxacin.INT | 387 | 649 Doxycycline.INT | 416 | 895 Marbofloxacin.INT | 445 | 781 Cefpodoxime.INT | 448 | 780 Chloramphenicol.INT | 452 | 434 Cefazolin.INT | 461 | 816 Imipenem.INT | 474 | 235 Amikacin | 497 | 312 Enrofloxacin.INT | 503 | 610 Ampicillin.INT | 509 | 894 Tetracycline.INT | 527 | 1117 Amoxicillin.Clavulanic.Acid.INT | 531 | 750 Gentamicin.INT | 560 | 573 Trimethoprim.Sulfamethoxazole.INT | 586 | 704
With help from Kristina:
RoaryPanaroo was ran on the 601 sequences we have phenotypic data for Scoary was then run on theRoaryPanaroo output + a tree generated by IQTreeScoary outputs a separate .csv file for each antibiotic (phenotypic class)