hollygene / CornellPostdoc

Various scripts I used during my tenure as a Postdoctoral Associate at Cornell University
0 stars 0 forks source link

SNP calling #11

Open hollygene opened 3 years ago

hollygene commented 3 years ago

Using GATK Best Practices

hollygene commented 3 years ago

https://github.com/hollygene/CornellPostdoc/blob/9dfb02780e0c640dcc6758e847f5b5265adaea64/variantGATK.sh

hollygene commented 3 years ago

Step 1: Produce file of accession numbers from all of the sequences we have phenotypic data for

https://github.com/hollygene/CornellPostdoc/blob/9dfb02780e0c640dcc6758e847f5b5265adaea64/misc_scripts.sh#L11

hollygene commented 3 years ago

Step 2: Download the sequences from NCBI

Using: https://github.com/hollygene/CornellPostdoc/blob/9dfb02780e0c640dcc6758e847f5b5265adaea64/get_SRR_data.sh#L4-L9

hollygene commented 3 years ago

Step 3: Create unmapped bams from fastqs

Using: https://github.com/hollygene/CornellPostdoc/blob/9dfb02780e0c640dcc6758e847f5b5265adaea64/variantGATK.sh#L12-L28

hollygene commented 3 years ago

Step 4: Mark Illumina adapters

https://github.com/hollygene/CornellPostdoc/blob/41c55b7b4f4e5378a4465d70d1dd5ba0bf2e836f/variantGATK.sh#L36-L53

hollygene commented 3 years ago

Step 5: Validate Sam File

https://github.com/hollygene/CornellPostdoc/blob/41c55b7b4f4e5378a4465d70d1dd5ba0bf2e836f/variantGATK.sh#L61-L72

hollygene commented 3 years ago

Step 6: Convert from Sam to Fastq

https://github.com/hollygene/CornellPostdoc/blob/41c55b7b4f4e5378a4465d70d1dd5ba0bf2e836f/variantGATK.sh#L78-L94

hollygene commented 3 years ago

Need a reference genome for next step - asking Stanhope/lab slack for recommendations on which version/strain to use

hollygene commented 3 years ago

Stanhope recommends using the PANTHER database reference genome

Escherichia coli | E. coli | ECOLI | EnsemblGenome | Reference Proteome 2020_04

https://www.ebi.ac.uk/reference_proteomes/

the E coli reference:

ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/Bacteria/UP000000625_83333.fasta.gz

hollygene commented 3 years ago

^ that is actually a proteome so gatk didn't work

Need a GENOME: ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/

hollygene commented 3 years ago

How to pick the best reference genome?

We are wanting to find SNPs that are associated with particular resistance phenotypes So ideally the reference would not be resistant to any abx A true "wild type" genome

However, we could also use a consensus sequence and call SNPs in samples from that Pros: no need for a possibly very diverged reference sequence Cons: Our dataset is biased because a lot of them are resistant

hollygene commented 3 years ago

WDL notes WDL: script that describes the workflow Cromwell: Java-based job scheduler that can use various backend environments Run mode & server mode

hollygene commented 3 years ago

Dockstore info page Descriptor file: script in wdl that tells the program what to do, basically tools: more info on each task + Docker container it is using for that task test parameters file: file with the input files (can make this manually) Launch tab: actual commands for running the file

hollygene commented 3 years ago

Decided to choose a reference genome that is from a canine

Genome chosen: https://www.ncbi.nlm.nih.gov/assembly/GCA_002310695.1#/def From: https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/167/ (filtered by host organism + complete genome) Strain: 1428 1 chromosome, 4 plasmids

AMR Genotypes: complete: acrF, blaCMY-2, blaEC,mdtM,tet(B) point: cyA_S352T

hollygene commented 3 years ago

Creating unmapped bams from fastq files

for file in ${raw_data}/*_1.fastq

do

FBASE=$(basename $file _1.fastq)
BASE=${FBASE%_1.fastq}
java -jar /programs/picard-tools-2.19.2/picard.jar FastqToSam \
    FASTQ=${raw_data}/${BASE}_1.fastq \
    FASTQ2=${raw_data}/${BASE}_2.fastq  \
    OUTPUT=${unmapped_bams}/${BASE}_fastqtosam.bam \
    READ_GROUP_NAME=${BASE} \
    SAMPLE_NAME=${BASE}

done