Create refseq sorter script

kmartinez834 commented 4 days ago

Create script to separate records from refseq_proteinall*.gpff files. The script should accept an organism argument so it returns only records from the desired organism.

FYI @katewarner

rykahsay commented 4 days ago

Here is the script ---> /software/glygen/filter-refseq.py, also showing parts of it below. As you can see below, the downloaded files (excluding ["9606", "10090", "10116", "9823"]) cannot have tax_id in them.

def main():

    species_obj = {}
    in_file = "/data/projects/glygen/generated/misc/species_info.csv"
    load_species_info(species_obj, in_file)

    skip_list = ["9606", "10090", "10116", "9823"]
    taxid2downloadfile = {
        "559292":"/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_fungi.gpff",
        "7227":"/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_insects.gpff",
        "9031":"/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_vertebrates.gpff",
        "3702":"/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_plants.gpff",
        "44689":"/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_amoebozoa.gpff"
    }

    for k in species_obj:
        if species_obj[k]["is_reference"] != "yes" or k.isdigit() == False:
            continue
        tax_id = str(species_obj[k]["tax_id"]) 
        long_name = species_obj[k]["long_name"]
        if tax_id in skip_list:
            continue
        in_file = taxid2downloadfile[tax_id]
        if os.path.isfile(in_file) == False:
            continue
        out_file = "/data/projects/glygen/downloads/ncbi/refseq/current/refseq_protein_all_%s.gpff" % (tax_id)
        FW = open(out_file, "w")
        record_count = 0
        for record in SeqIO.parse(in_file, "genbank"):
            organism = record.annotations['organism']
            if organism.lower().find(long_name.lower()) != -1:
                SeqIO.write(record, FW, "gb")
                record_count += 1
        FW.close()
        cmd = "chmod 775 %s"  % (out_file)
        x = subprocess.getoutput(cmd)
        print ("Created file: %s" % (out_file))

kmartinez834 commented 3 days ago

I've added the script above to the download_refseq.sh file.

@rykahsay can you let me know when you're done with all of the refseq source files? I'd like to do a test run of the download, but don't want to mess up current link if you're still using it.

rykahsay commented 3 days ago

I have done some minor moidifications to the script, please update.

For this release, I have already done the sorting as shown below.

-rwxrwxr-x. 1 k.warner1 glygen   660199875 Jun 24 09:10 downloads/ncbi/refseq/current/refseq_protein_all_10090.gpff
-rwxrwxr-x. 1 k.warner1 glygen   430973543 Jun 24 09:11 downloads/ncbi/refseq/current/refseq_protein_all_10116.gpff
-rwxrwxr-x. 1 rykahsay  glygen   352229135 Jul  1 22:36 downloads/ncbi/refseq/current/refseq_protein_all_3702.gpff
-rwxrwxr-x. 1 rykahsay  glygen    95995620 Jul  1 20:42 downloads/ncbi/refseq/current/refseq_protein_all_44689.gpff
-rwxrwxr-x. 1 rykahsay  glygen    39798105 Jul  1 20:30 downloads/ncbi/refseq/current/refseq_protein_all_559292.gpff
-rwxrwxr-x. 1 rykahsay  glygen   480816836 Jul  1 19:25 downloads/ncbi/refseq/current/refseq_protein_all_7227.gpff
-rwxrwxr-x. 1 rykahsay  glygen   615098329 Jul  1 17:28 downloads/ncbi/refseq/current/refseq_protein_all_9031.gpff
-rwxrwxr-x. 1 k.warner1 glygen  1322064964 Jun 24 09:10 downloads/ncbi/refseq/current/refseq_protein_all_9606.gpff
-rwxrwxr-x. 1 k.warner1 glygen   279404223 Jun 24 09:11 downloads/ncbi/refseq/current/refseq_protein_all_9823.gpff
-rwxrwxr-x. 1 k.warner1 glygen  4538659530 Jun 24 09:14 downloads/ncbi/refseq/current/refseq_protein_all_amoebozoa.gpff
-rwxrwxr-x. 1 k.warner1 glygen 24820500219 Jun 24 09:13 downloads/ncbi/refseq/current/refseq_protein_all_fungi.gpff
-rwxrwxr-x. 1 k.warner1 glygen 41928681959 Jun 24 09:12 downloads/ncbi/refseq/current/refseq_protein_all_insects.gpff
-rwxrwxr-x. 1 k.warner1 glygen 33236712734 Jun 24 09:18 downloads/ncbi/refseq/current/refseq_protein_all_plants.gpff
-rwxrwxr-x. 1 k.warner1 glygen 73568323306 Jun 24 09:16 downloads/ncbi/refseq/current/refseq_protein_all_vertebrates.gpff
-rwxrwxr-x. 1 k.warner1 glygen  2020643689 Jun 24 09:11 downloads/ncbi/refseq/current/refseq_protein_all_viral.gpff

glygener / glygen-issues

Create refseq sorter script #1491