Open rx32940 opened 5 years ago
1251 wgs sra records were extracted from NCBI, 1037 of them are unique isolates
For isolates with collection date: out of 240 sra records were extracted from NCBI, only 182 unique isolates were identified.
By adding unique isolates identified from both sra records and from Patric and also those overlapped in the two databases, I was able to identify 388 unique isolates with the collection date.
/scratch/rx32940/assembled_51/{biosample_id}/scaffolds.fasta
Note: 313 isolates from Picardeu lab in their 2019 paper was also included in the 644 isolates
4 PATRIC assemblies found no assembly records on NCBI, thus 633 PATRIC assemblies (w/ and w/o collection date):
all PATRIC assemblies were downloaded from NCBI's assembly database with their latest version (at 1/22/20) under the folder:
/scratch/rx32940/PATRIC_assemblies_633/ncbi-genomes-2020-01-22
because all assembly files were named with their assembly accession, I renamed all file to their biosample accession to be consistent (dict for assembly to biosample found from PATRIC metadata table) : script for conversion
/scratch/rx32940/picardeau/SRA_seq/
-- we only left with 528 no date isolates (215 excluding the picardeau isolates)
PRJEB2095 (104 isolates) from Wellcome Sanger Institute is pre-publication, probably need approval for use https://www.sanger.ac.uk/resources/downloads/bacteria/leptospira-interrogans.html
test
Skipped (too short):
24901017 bp (1.1764%).
115285 reads (39.8951%).
gatekeeperCreate did NOT finish successfully; too many short reads. Check your reads!
gatekeeperCreate finished successfully.
currently, 525 isolates without collection date metadata (212 w/o picardeau isolates)
12 runs stored in ERX006638 were "Illumina sequencing of barcoded sample library "L interogans 3" for the "Discovery of sequence diversity in Leptospira interrogans ST34" study"
sample SAMN04090012, has 4 technical replicates (SRR2423277, SRR2423278, SRR2423279, SRR2423280). four replicates for the biosample was cat into the same file with the command below
cat SRR24232*_1(2).fastq.gz > SAMN04090012_1(2).fastq.gz
some of the pair-end fastq sequenced by Wellcome Sanger Institute(SC) were labeled as "_3" or "_4" for the reverse file instead of "_2" at end of the file (before extensions).
the runs with reverse fastq labeled with "_4" also have a file labeled as "_3", however, only reads with single nucleotide was found in these files, thus will not be used for assembly.
runs with "_3" and "_4" labels for reverse file can be find in
/scratch/rx32940/rest_sra_216/PairEnd_3(4).txt
4/212 no date assembled isolates wasn't able to get assmbled:
pipeline didn't finish with the error ERR017078 (SAMEA864154) ERR017080 (SAMEA864155) ERR017165 (SAMEA864093) - coverage size too low
== Error == system call for: "['/usr/local/apps/gb/spades/3.12.0-k_245/bin/spades-core', '/scratch/rx32940/rest_sra_216/assemblies/ERR017165/K21/configs/config.info']" finished abnormally, err code: -6
pipeline finished with only contigs, but not scaffolds SAMN04090012
all assemblies finished, with 1209 isolates in total (excluding 12 assemblies from barcoded runs found in rest_sra_216
folder, including 4 fail to assemble)
/scratch/rx32940/All_Lepto_Assemblies/all_assemblies_1209.txt
assemblies
sub-dir in the scaffolds.fasta
file (the 4 isolates couldn't be assembled has no scaffolds.fasta ), except for PATRIC_assemblies_633, which has assemblies stored in ${biosample_acc}.fna.gz
file
/scratch/rx32940/All_Lepto_Assemblies/dated_assembled_51(picardeau_313/rest_sra_216/PATRIC_assemblies_633)
quast_assemblies
folder from the dir above
/scratch/rx32940/reference
/Users/rx32940/Dropbox/5.Rachel-projects/Phylogeography/Organism_fullname_1209.txt