Open kmartinez834 opened 4 days ago
Here is the script ---> /software/glygen/filter-refseq.py, also showing parts of it below. As you can see below, the downloaded files (excluding ["9606", "10090", "10116", "9823"]) cannot have tax_id in them.
def main():
species_obj = {}
in_file = "/data/projects/glygen/generated/misc/species_info.csv"
load_species_info(species_obj, in_file)
skip_list = ["9606", "10090", "10116", "9823"]
taxid2downloadfile = {
"559292":"/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_fungi.gpff",
"7227":"/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_insects.gpff",
"9031":"/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_vertebrates.gpff",
"3702":"/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_plants.gpff",
"44689":"/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_amoebozoa.gpff"
}
for k in species_obj:
if species_obj[k]["is_reference"] != "yes" or k.isdigit() == False:
continue
tax_id = str(species_obj[k]["tax_id"])
long_name = species_obj[k]["long_name"]
if tax_id in skip_list:
continue
in_file = taxid2downloadfile[tax_id]
if os.path.isfile(in_file) == False:
continue
out_file = "/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_%s.gpff" % (tax_id)
FW = open(out_file, "w")
record_count = 0
for record in SeqIO.parse(in_file, "genbank"):
organism = record.annotations['organism']
if organism.lower().find(long_name.lower()) != -1:
SeqIO.write(record, FW, "gb")
record_count += 1
FW.close()
cmd = "chmod 775 %s" % (out_file)
x = subprocess.getoutput(cmd)
print ("Created file: %s" % (out_file))
I've added the script above to the download_refseq.sh file.
@rykahsay can you let me know when you're done with all of the refseq source files? I'd like to do a test run of the download, but don't want to mess up current link if you're still using it.
I have done some minor moidifications to the script, please update.
For this release, I have already done the sorting as shown below.
-rwxrwxr-x. 1 k.warner1 glygen 660199875 Jun 24 09:10 downloads/ncbi/refseq/current/refseq_protein_all_10090.gpff
-rwxrwxr-x. 1 k.warner1 glygen 430973543 Jun 24 09:11 downloads/ncbi/refseq/current/refseq_protein_all_10116.gpff
-rwxrwxr-x. 1 rykahsay glygen 352229135 Jul 1 22:36 downloads/ncbi/refseq/current/refseq_protein_all_3702.gpff
-rwxrwxr-x. 1 rykahsay glygen 95995620 Jul 1 20:42 downloads/ncbi/refseq/current/refseq_protein_all_44689.gpff
-rwxrwxr-x. 1 rykahsay glygen 39798105 Jul 1 20:30 downloads/ncbi/refseq/current/refseq_protein_all_559292.gpff
-rwxrwxr-x. 1 rykahsay glygen 480816836 Jul 1 19:25 downloads/ncbi/refseq/current/refseq_protein_all_7227.gpff
-rwxrwxr-x. 1 rykahsay glygen 615098329 Jul 1 17:28 downloads/ncbi/refseq/current/refseq_protein_all_9031.gpff
-rwxrwxr-x. 1 k.warner1 glygen 1322064964 Jun 24 09:10 downloads/ncbi/refseq/current/refseq_protein_all_9606.gpff
-rwxrwxr-x. 1 k.warner1 glygen 279404223 Jun 24 09:11 downloads/ncbi/refseq/current/refseq_protein_all_9823.gpff
-rwxrwxr-x. 1 k.warner1 glygen 4538659530 Jun 24 09:14 downloads/ncbi/refseq/current/refseq_protein_all_amoebozoa.gpff
-rwxrwxr-x. 1 k.warner1 glygen 24820500219 Jun 24 09:13 downloads/ncbi/refseq/current/refseq_protein_all_fungi.gpff
-rwxrwxr-x. 1 k.warner1 glygen 41928681959 Jun 24 09:12 downloads/ncbi/refseq/current/refseq_protein_all_insects.gpff
-rwxrwxr-x. 1 k.warner1 glygen 33236712734 Jun 24 09:18 downloads/ncbi/refseq/current/refseq_protein_all_plants.gpff
-rwxrwxr-x. 1 k.warner1 glygen 73568323306 Jun 24 09:16 downloads/ncbi/refseq/current/refseq_protein_all_vertebrates.gpff
-rwxrwxr-x. 1 k.warner1 glygen 2020643689 Jun 24 09:11 downloads/ncbi/refseq/current/refseq_protein_all_viral.gpff
Create script to separate records from refseq_proteinall*.gpff files. The script should accept an organism argument so it returns only records from the desired organism.
FYI @katewarner