pachterlab / gget

🧬 gget enables efficient querying of genomic reference databases
https://gget.bio
BSD 2-Clause "Simplified" License
946 stars 74 forks source link

Is it possible to get all ELM's using gget? #119

Closed Abhishaike closed 10 months ago

Abhishaike commented 10 months ago

Request type

Extension of existing module

Request description

Can I get all ELM's using just gget?

Example command

_, regex_df = gget.elm('')

Example return value

regex_df would contain all ELM's

lauraluebbert commented 10 months ago

Hi Abhishaike, thank you for reaching out! This is kind of what gget.setup('elm') does. This command saves all ELMs in tsv files in a folder called 'elm_files' inside the gget installation folder by default. I just added a new out argument to gget.setup, which allows the user to specify an alternative download folder. This option will be part of the next gget release (v0.28.3), but you can already install it from the dev branch and try it out:

## Install gget from the dev branch
#!pip install -q mysql-connector-python==8.0.29
#!pip install -q git+https://github.com/pachterlab/gget.git@dev

!pip install -q gget
import gget

# Save all ELMs in the current directory
gget.setup("elm", out="./")

# Open ELM files using pandas
import pandas as pd
# Load all ELM instances
df_instances = pd.read_csv("elm_instances.tsv", sep="\t", skiprows=5)
# Load additional information about ELMs (description, functional site, etc.)
df_classes = pd.read_csv("elms_classes.tsv", sep="\t", skiprows=5)
# Load additional information about interaction domains
df_intdomains = pd.read_csv("elm_interaction_domains.tsv", sep="\t")
# Rename columns in interaction domains file to match other files
df_intdomains = df_intdomains.rename(
        columns={
            "ELM identifier": "ELMIdentifier",
            "Interaction Domain Id": "InteractionDomainId",
            "Interaction Domain Description": "InteractionDomainDescription",
            "Interaction Domain Name": "InteractionDomainName"
        }
    )

# Merge information about all ELMs into a single data frame
df_elm = df_instances.merge(df_classes, how="left", on="ELMIdentifier")
df_elm = df_elm.merge(df_intdomains, how="left", on="ELMIdentifier")
df_elm

Please note that the dev branch is currently undergoing active development, and there might be breaking changes. Does this solve your request?

Edit: I also added information about interaction domains. Edit: v0.28.3 will be released today, so moving forward, there is no need to install gget from the dev branch for this.

lauraluebbert commented 10 months ago

I'm going to go ahead and close this issue, but please let me know if the proposed solution does not work.

Abhishaike commented 9 months ago

Just saw this, thank you!

I am noticing that a lot of the interaction domain columns are missing, df_intdomains only contains 4 columns...is there a way to add in the affinity + start/stop ELM bits?

Abhishaike commented 9 months ago

Poking at this again!