Open hollygene opened 3 years ago
Step 1: Produce file of accession numbers from all of the sequences we have phenotypic data for
Step 2: Download the sequences from NCBI
Step 3: Create unmapped bams from fastqs
Step 4: Mark Illumina adapters
Step 6: Convert from Sam to Fastq
Need a reference genome for next step - asking Stanhope/lab slack for recommendations on which version/strain to use
Stanhope recommends using the PANTHER database reference genome
Escherichia coli | E. coli | ECOLI | EnsemblGenome | Reference Proteome 2020_04
https://www.ebi.ac.uk/reference_proteomes/
the E coli reference:
ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/Bacteria/UP000000625_83333.fasta.gz
^ that is actually a proteome so gatk didn't work
Need a GENOME: ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/
How to pick the best reference genome?
We are wanting to find SNPs that are associated with particular resistance phenotypes So ideally the reference would not be resistant to any abx A true "wild type" genome
However, we could also use a consensus sequence and call SNPs in samples from that Pros: no need for a possibly very diverged reference sequence Cons: Our dataset is biased because a lot of them are resistant
WDL notes WDL: script that describes the workflow Cromwell: Java-based job scheduler that can use various backend environments Run mode & server mode
Dockstore info page Descriptor file: script in wdl that tells the program what to do, basically tools: more info on each task + Docker container it is using for that task test parameters file: file with the input files (can make this manually) Launch tab: actual commands for running the file
Decided to choose a reference genome that is from a canine
Genome chosen: https://www.ncbi.nlm.nih.gov/assembly/GCA_002310695.1#/def From: https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/167/ (filtered by host organism + complete genome) Strain: 1428 1 chromosome, 4 plasmids
AMR Genotypes: complete: acrF, blaCMY-2, blaEC,mdtM,tet(B) point: cyA_S352T
Creating unmapped bams from fastq files
for file in ${raw_data}/*_1.fastq
do
FBASE=$(basename $file _1.fastq)
BASE=${FBASE%_1.fastq}
java -jar /programs/picard-tools-2.19.2/picard.jar FastqToSam \
FASTQ=${raw_data}/${BASE}_1.fastq \
FASTQ2=${raw_data}/${BASE}_2.fastq \
OUTPUT=${unmapped_bams}/${BASE}_fastqtosam.bam \
READ_GROUP_NAME=${BASE} \
SAMPLE_NAME=${BASE}
done
Using GATK Best Practices