Closed joewandy closed 3 years ago
i'm thinking https://github.com/glasgowcompbio/pyMultiOmics/blob/main/pyMultiOmics/mapping.py#L42-L45, add a parameter related=True
. If True, then run the codes to retrieve related compounds, and set that into self.compound_df in place of the original df
I'm not sure how this should be implemented, if the user gives KEGG IDs for the compounds do you want to find the Chebi for the KEGG IDs and then all the related Chebi ids or do you just want this to run if Chebi IDs are provided?
Working on the related_chebi branch, I added this parameter include_related_chebi
in the constructor of the Mapper class: https://github.com/glasgowcompbio/pyMultiOmics/blob/related_chebi/pyMultiOmics/mapping.py#L18.
Eventually that flag is passed to the reactome mapping function. If set to True, then we should find all the related chebi ids that are linked to the input chebi ids: https://github.com/glasgowcompbio/pyMultiOmics/blob/e5f64e30b8eb81f1b319ecc770fa6a14929ae48b/pyMultiOmics/functions.py#L72-L74.
Could you help to add the implementation inside the get_related_chebi
method please? Thanks. It's here: https://github.com/glasgowcompbio/pyMultiOmics/blob/e5f64e30b8eb81f1b319ecc770fa6a14929ae48b/pyMultiOmics/functions.py#L308-L310
Code currently finds all rows that have chebi_ids with related chebi_ids. For each such chebi_id all rows are selected, copied and the Indentifier replaced with the related chebi. The code then checks to see if these new rows are in the original DF and if they are not the rows are appended to the orginal DF (same intensity rows, new related_chebi identifiers). For a large DF such as found in FlyMet which has duplicate rows of chebi_ids (same compound different peak data) and 30,000 rows - this is a slow process. This procedure would not be run on the flymet DF as all related chebi_ids have already been added but it is a test case. If this method is okay, I will implement it.
Added the notebook to https://github.com/glasgowcompbio/pyMultiOmics/blob/related_chebi/notebooks/analysis_zebrafish-related_chebi.ipynb
I did an alternative version of get_related_chebi_data_v2
, what do you think? I guess functionally it's the same as yours. It loops through each row and check if there are related chebi ids. If yes, then add them into the results, but only if it doesn't already exist in the original dataframe.
def get_related_chebi_data_v2(cmpd_data):
cmpd_data = cmpd_data.copy()
# ensure index type is set to string, since get_chebi_relation_dict also returns string as the keys
cmpd_data.index = cmpd_data.index.map(str)
cmpd_data = cmpd_data.reset_index()
original_cmpds = set(cmpd_data['Identifier']) # used for checking later
# construct the related chebi dict
chebi_rel_dict = get_chebi_relation_dict()
# loop through each row in cmpd_data
with_related_data = []
for ix, row in cmpd_data.iterrows():
# add the current row we're looping
current_identifier = row['Identifier']
with_related_data.append(row)
# check if there are related compounds to add
if current_identifier in chebi_rel_dict:
# if yes, get the related compounds
chebi_list = chebi_rel_dict[current_identifier]
for c in chebi_list:
# add the related chebi, but only if it's not already present in the original compound
if c not in original_cmpds:
current_row = row.copy()
current_row['Identifier'] = c
with_related_data.append(current_row)
# combine all the rows into a single dataframe
df = pd.concat(with_related_data, axis=1).T
df = df.set_index('Identifier')
logger.info('Inserted %d related compounds' % (len(df) - len(cmpd_data)))
return df
The duplicate check I added it into a separate function that can be called after the above (or even separately if we want). It groups rows in the dataframe using the 'Identifier' column. If there are multiple rows in the same group (same compound different peak data), then we select the row with the largest sum of intensities across the entire row and delete the rest. Not sure if selecting based on this criteria is the best thing to do, but let's go with this for now.
def remove_dupes(df):
df = df.reset_index()
# group df by the 'Identifier' column
to_delete = []
grouped = df.groupby(df['Identifier'])
for identifier, group_df in grouped:
# if there are multiple rows sharing the same identifier
if len(group_df) > 1:
# remove 'Identifier' column from the grouped df since it can't be summed
group_df = group_df.drop('Identifier', axis=1)
# find the row with the largest sum across the row in the group
idxmax = group_df.sum(axis=1).idxmax()
# mark all the rows in the group for deletion, except the one with the largest sum
temp = group_df.index.tolist()
temp.remove(idxmax)
to_delete.extend(temp)
# actually do the deletion here
logger.info('Removing %d rows with duplicate identifiers' % (len(to_delete)))
df = df.drop(to_delete)
df = df.set_index('Identifier')
return df
I'm confused why you are removing these duplicates. PALS was set up to use all of the peaks that could potentially be a compound. Choosing the one with the largest sum of intensities doesn't make sense to me at all - why is that most likely to be the actual compound?
Users are expected to provide a clean input (with one compound per row). The check to remove duplicates is only there when the users fail to do that.
Done by @kmcluskey
Related to https://github.com/glasgowcompbio/WebOmics/issues/70
@kmcluskey has implemented some codes to pull related chebi ids from the csv. We should incorporate them into this package (as an option when mapping) to get more hits.