glasgowcompbio / pyMultiOmics

Python toolbox for multi-omics data mapping and analysis
MIT License
19 stars 4 forks source link

Improve chebi mapping #5

Closed joewandy closed 3 years ago

joewandy commented 3 years ago

Related to https://github.com/glasgowcompbio/WebOmics/issues/70

@kmcluskey has implemented some codes to pull related chebi ids from the csv. We should incorporate them into this package (as an option when mapping) to get more hits.

joewandy commented 3 years ago

i'm thinking https://github.com/glasgowcompbio/pyMultiOmics/blob/main/pyMultiOmics/mapping.py#L42-L45, add a parameter related=True. If True, then run the codes to retrieve related compounds, and set that into self.compound_df in place of the original df

kmcluskey commented 3 years ago

I'm not sure how this should be implemented, if the user gives KEGG IDs for the compounds do you want to find the Chebi for the KEGG IDs and then all the related Chebi ids or do you just want this to run if Chebi IDs are provided?

joewandy commented 3 years ago

Working on the related_chebi branch, I added this parameter include_related_chebi in the constructor of the Mapper class: https://github.com/glasgowcompbio/pyMultiOmics/blob/related_chebi/pyMultiOmics/mapping.py#L18.

Eventually that flag is passed to the reactome mapping function. If set to True, then we should find all the related chebi ids that are linked to the input chebi ids: https://github.com/glasgowcompbio/pyMultiOmics/blob/e5f64e30b8eb81f1b319ecc770fa6a14929ae48b/pyMultiOmics/functions.py#L72-L74.

Could you help to add the implementation inside the get_related_chebi method please? Thanks. It's here: https://github.com/glasgowcompbio/pyMultiOmics/blob/e5f64e30b8eb81f1b319ecc770fa6a14929ae48b/pyMultiOmics/functions.py#L308-L310

kmcluskey commented 3 years ago

Code currently finds all rows that have chebi_ids with related chebi_ids. For each such chebi_id all rows are selected, copied and the Indentifier replaced with the related chebi. The code then checks to see if these new rows are in the original DF and if they are not the rows are appended to the orginal DF (same intensity rows, new related_chebi identifiers). For a large DF such as found in FlyMet which has duplicate rows of chebi_ids (same compound different peak data) and 30,000 rows - this is a slow process. This procedure would not be run on the flymet DF as all related chebi_ids have already been added but it is a test case. If this method is okay, I will implement it.

joewandy commented 3 years ago

Added the notebook to https://github.com/glasgowcompbio/pyMultiOmics/blob/related_chebi/notebooks/analysis_zebrafish-related_chebi.ipynb

I did an alternative version of get_related_chebi_data_v2, what do you think? I guess functionally it's the same as yours. It loops through each row and check if there are related chebi ids. If yes, then add them into the results, but only if it doesn't already exist in the original dataframe.

def get_related_chebi_data_v2(cmpd_data):
    cmpd_data = cmpd_data.copy()

    # ensure index type is set to string, since get_chebi_relation_dict also returns string as the keys
    cmpd_data.index = cmpd_data.index.map(str)
    cmpd_data = cmpd_data.reset_index()
    original_cmpds = set(cmpd_data['Identifier']) # used for checking later

    # construct the related chebi dict
    chebi_rel_dict = get_chebi_relation_dict()    

    # loop through each row in cmpd_data
    with_related_data = []
    for ix, row in cmpd_data.iterrows():   

        # add the current row we're looping
        current_identifier = row['Identifier']
        with_related_data.append(row)

        # check if there are related compounds to add
        if current_identifier in chebi_rel_dict:

            # if yes, get the related compounds
            chebi_list = chebi_rel_dict[current_identifier]        
            for c in chebi_list:

                # add the related chebi, but only if it's not already present in the original compound
                if c not in original_cmpds:
                    current_row = row.copy()
                    current_row['Identifier'] = c
                    with_related_data.append(current_row)

    # combine all the rows into a single dataframe
    df = pd.concat(with_related_data, axis=1).T
    df = df.set_index('Identifier')
    logger.info('Inserted %d related compounds' % (len(df) - len(cmpd_data)))    
    return df

The duplicate check I added it into a separate function that can be called after the above (or even separately if we want). It groups rows in the dataframe using the 'Identifier' column. If there are multiple rows in the same group (same compound different peak data), then we select the row with the largest sum of intensities across the entire row and delete the rest. Not sure if selecting based on this criteria is the best thing to do, but let's go with this for now.

def remove_dupes(df):    
    df = df.reset_index()

    # group df by the 'Identifier' column
    to_delete = []
    grouped = df.groupby(df['Identifier'])
    for identifier, group_df in grouped:

        # if there are multiple rows sharing the same identifier
        if len(group_df) > 1: 

            # remove 'Identifier' column from the grouped df since it can't be summed
            group_df = group_df.drop('Identifier', axis=1)

            # find the row with the largest sum across the row in the group
            idxmax = group_df.sum(axis=1).idxmax()

            # mark all the rows in the group for deletion, except the one with the largest sum
            temp = group_df.index.tolist()
            temp.remove(idxmax)
            to_delete.extend(temp)

    # actually do the deletion here
    logger.info('Removing %d rows with duplicate identifiers' % (len(to_delete)))
    df = df.drop(to_delete)
    df = df.set_index('Identifier')
    return df
kmcluskey commented 3 years ago

I'm confused why you are removing these duplicates. PALS was set up to use all of the peaks that could potentially be a compound. Choosing the one with the largest sum of intensities doesn't make sense to me at all - why is that most likely to be the actual compound?

joewandy commented 3 years ago

Users are expected to provide a clean input (with one compound per row). The check to remove duplicates is only there when the users fail to do that.

joewandy commented 3 years ago

Done by @kmcluskey