transcript / COG

Scripts and steps for making a consolidated COG database file for tools like DIAMOND.
11 stars 8 forks source link

Code update request to merge COG2020 files #2

Open vinuesa opened 3 years ago

vinuesa commented 3 years ago

Hi Sam, thanks for sharing your code on GitHub. Your scripts work smoothly. I was just wondering if you were planing to update your merge.py script to process the updated COG2020 database file formats.

Reference https://academic.oup.com/nar/article/49/D1/D274/5964069 ftp://ftp.ncbi.nih.gov/pub/COG/COG2020/data

SolayMane commented 3 years ago

here is the modified script to adapt COG2020 https://github.com/SolayMane/FOA_scripts/blob/main/merge_cog202.py

kkpenn commented 3 years ago

Wrote a second modification for Python 3 and to find the COG IDs for all sequences: https://github.com/kkpenn/merger_COG2020/blob/main/merger_2.py

vinuesa commented 3 years ago

Hi Sam, thanks for notifying the update. This is great news. Take care

Rmmendoza commented 3 years ago

Hello Sam,

Thank you very much for your python scripts for COG annotation, I'm just wondering if the script DIAMOND_COG_Analysis_Counter.py is compatible with recent update for COG2020 database? I tried following your documentation here: https://github.com/transcript/COG however at when I reached the part on Analyzing the results of a DIAMOND search against the COG database, I encountered this error:

python DIAMOND_COG_analysis_counter.py -I LbSa_Pan.cogs -O LbSaResults.cog -D merged_cogs.fa

Analysis of LbSa_Pan.cogs complete. Number of total lines: 67960 Number of unique sequences: 3181 Time elapsed: 0.32341003418 seconds.

Starting database analysis now. Traceback (most recent call last): File "DIAMOND_COG_analysis_counter.py", line 125, in db_id = str(splitline[0] + "|" + splitline[1] + "|" + splitline[2] + "|" + splitline[3] + "|")[1:] IndexError: list index out of range

The version of DIAMOND_COG_Analysis_Counter.py that I used is the one that has been updated for time.clock().

Again thank you very much, I appreciate any help. Stay safe!

Malokidz commented 2 years ago

Hello @Rmmendoza. You can deal with this problem by deleting the pip | in the script.

I have a Question how can I deal with this Problem ? org = db_hier_dictionary[entry] Analysis of test.cogs complete. Number of total lines: 1 Number of unique sequences: 1 Time elapsed: 0.000323057174683 seconds.

Starting database analysis now. 1000000 lines processed so far in 7.30374193192 seconds. 2000000 lines processed so far in 14.0681209564 seconds. 3000000 lines processed so far in 20.5442709923 seconds.

Success! Time elapsed: 21.8866050243 seconds. Number of lines: 3213025 Number of errors: 0 Traceback (most recent call last): File "/home/abdelmalek/DIAMOND_COG_analysis_counter1.py", line 158, in org = db_hier_dictionary[entry] KeyError: 'WP_101828536_1'

SolayMane commented 2 years ago

the protein with key = WP_101828536_1 absent in your dict db_hier_dictionary

Malokidz commented 2 years ago

Dear @SolayMane. Could you please help me to resolve this issue? Analysis of 1pantoea.cogs complete. Number of total lines: 1 Number of unique sequences: 1 Time elapsed: 0.0009813308715820312 seconds.

Starting database analysis now. Traceback (most recent call last): File "DIAMOND_COG_analysis_counter1.py", line 125, in db_id = str(splitline[0] + "|" + splitline[1] + "|" + splitline[2] + "|" + splitline[3] + "|")[1:] IndexError: list index out of range

SolayMane commented 2 years ago

@Malokidz can you past here your python code here

Malokidz commented 2 years ago

Dear @SolayMane. Could you please help me to resolve this issue? Analysis of 1pantoea.cogs complete. Number of total lines: 1 Number of unique sequences: 1 Time elapsed: 0.0009813308715820312 seconds.

Starting database analysis now. Traceback (most recent call last): File "DIAMOND_COG_analysis_counter1.py", line 125, in db_id = str(splitline[0] + "|" + splitline[1] + "|" + splitline[2] + "|" + splitline[3] + "|")[1:] IndexError: list index out of range

Malokidz commented 2 years ago

Dear @SolayMane. I am using the DIAMOND_COG_analysis_counter.py script. My Problem is in uilding a dictionary of the reference database.

building a dictionary of the reference database

db_hier_dictionary = {} db_line_counter = 0 db_error_counter = 0

for line in db: if line.startswith(">") == True: db_line_counter += 1 splitline = line.split("|")

    # ID, the hit returned in DIAMOND results
    db_id = str(splitline[0] + "|" + splitline[1] + "|" + splitline[2] + "|" + splitline[3] + "|")[1:]

    # name and functional description
    if "NO COG FOUND" in splitline[1]:
        db_hier = "NO HIERARCHY"
    else:
        hier_split = line.split("|")
        db_hier = hier_split[5] + " | " + hier_split[6].strip()

    # add to dictionaries
    db_hier_dictionary[db_id] = db_hier

    # line counter to show progress
    if db_line_counter % 1000000 == 0:                          # each million
        t95 = time.time()
        print (str(db_line_counter) + " lines processed so far in " + str(t95-t2) + " seconds.")

t3 = time.time()

print ("\nSuccess!") print ("Time elapsed: " + str(t3-t2) + " seconds.") print ("Number of lines: " + str(db_line_counter)) print ("Number of errors: " + str(db_error_counter))

Thewhitewolf8 commented 2 years ago

Dear @SolayMane I am using the DIAMOND_COG_analysis_counter.py script for one of the analysis. But getting following error, though that key is present in the dictionary. Could you please help me to resolve this issue.


python Diamond.py -I test.cogs -O result.cogs -D merged_cogs.fa

Analysis of test.cogs complete. Number of total lines: 25 Number of unique sequences: 1 Time elapsed: 7.796287536621094e-05 seconds.

Starting database analysis now. 1000000 lines processed so far in 2.3694207668304443 seconds. 2000000 lines processed so far in 4.80712628364563 seconds. 3000000 lines processed so far in 7.313858270645142 seconds.

Success! Time elapsed: 7.811172246932983 seconds. Number of lines: 3213025 Number of errors: 0 Traceback (most recent call last): File "/home/srmap/Desktop/sharayu/tools/COG/COG-master/Diamond.py", line 158, in org = db_hier_dictionary[entry] KeyError: 'WP_003663563.1'

/Desktop/sharayu/tools/COG/COG-master$ grep "WP_003663563.1" merged_cogs.fa

WP_003663563.1 ribonucleoside-diphosphate reductase subunit alpha [Moraxella catarrhalis] | COG0209 | F

Proelmocan23 commented 2 years ago

I am new to coding. These are the fixes I made to get it working. Please let me know if there is something fundamentally wrong about how I approached these issues.

https://github.com/Proelmocan23/DIAMOND_COG2020_analysis_counter/blob/main/DIAMOND_COG2020_analysis_counter.py

Thewhitewolf8 commented 1 year ago

Hi

The code that you shared is executing nicely. Thank you for it. Could you please tell me how can I use it for multiple genomes in one go?

On Wed, Oct 12, 2022, 1:58 AM Proelmocan23 @.***> wrote:

I am new to coding. These are the fixes I made to get it working. Please let me know if there is something fundamentally wrong about how I approached these issues.

https://github.com/Proelmocan23/DIAMOND_COG2020_analysis_counter/blob/main/DIAMOND_COG2020_analysis_counter.py

— Reply to this email directly, view it on GitHub https://github.com/transcript/COG/issues/2#issuecomment-1275237599, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3DE4JCBW4A5WD74QLWYNIDWCXEVLANCNFSM4XSYNZMQ . You are receiving this because you commented.Message ID: @.***>

Proelmocan23 commented 1 year ago

Hi @Thewhitewolf8,

I'm not sure what you mean by multiple genomes in one go? Could you give me more details about the aim of your project?

I assume you are running this on linux? So could a manually creating a .sh file be useful to you?

contents of the file would be as follows:

!/bin/bash

python3 DIAMOND_COG2020_analysis_counter.py -I your_genome_1.cogs -O result_1.cogs -D merged_cogs.fa

python3 DIAMOND_COG2020_analysis_counter.py -I your_genome_2.cogs -O result_2.cogs -D merged_cogs.fa

python3 DIAMOND_COG2020_analysis_counter.py -I your_genome_3.cogs -O result_3.cogs -D merged_cogs.fa

Thewhitewolf8 commented 1 year ago

Thanks for your response. Actually I have around 300 genomes of bacteriophages. And i want to annotate then all together. Like it should show cog categorisation for around 1lack orfs. That I am not getting.

On Wed, Dec 7, 2022, 6:14 AM Proelmocan23 @.***> wrote:

Hi @Thewhitewolf8 https://github.com/Thewhitewolf8,

I'm not sure what you mean by multiple genomes in one go? Could you give me more details about the aim of your project?

I assume you are running this on linux? So could a manually creating a .sh file be useful to you?

contents of the file would be as follows:

!/bin/bash

python3 DIAMOND_COG2020_analysis_counter.py -I your_genome_1.cogs -O result_1.cogs -D merged_cogs.fa

python3 DIAMOND_COG2020_analysis_counter.py -I your_genome_2.cogs -O result_2.cogs -D merged_cogs.fa

python3 DIAMOND_COG2020_analysis_counter.py -I your_genome_3.cogs -O result_3.cogs -D merged_cogs.fa

— Reply to this email directly, view it on GitHub https://github.com/transcript/COG/issues/2#issuecomment-1340217980, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3DE4JBPADHME26LNRG52E3WL7MYBANCNFSM4XSYNZMQ . You are receiving this because you were mentioned.Message ID: @.***>

ShwetaaPandey commented 1 year ago

Hi @kkpenn,

I am using your script in update COG files. But I having this error:

File "merger_2.py", line 16 print(f"\nCog file read. Time elapsed: {t1-t0} seconds.") ^ SyntaxError: invalid syntax

ShwetaaPandey commented 1 year ago

Hi @SolayMane ,

Your script is unavailable. the webpage is not available.

SolayMane commented 1 year ago

Hi @ShwetaaPandey , you can find it here https://github.com/SolayMane/MyToolBox/blob/main/merge_cog202.py

ShwetaaPandey commented 1 year ago

Hi @SolayMane , I am having an error while your script:

Cog file read. Time elapsed: 3.680607 seconds. Traceback (most recent call last): File "merge_cog202.py", line 30, in trans_cog_db_file.write(str(trans_cog_db)) NameError: name 'trans_cog_db_file' is not defined