twitter / AnomalyDetection

Anomaly Detection with R
GNU General Public License v3.0
3.57k stars 779 forks source link

Fastest way to combine two tables on common columns (based on nearest timestamp and same ID) #115

Closed hscj87 closed 3 years ago

hscj87 commented 3 years ago

I am trying to combine two dataframes (~10 Million rows) on column timestamp and ID. Due to both dataframes dont have the same timestamp for same ID, i need to use a function called 'map_RSRP' to combine both using conditions. However below codes consume very very long time (few days time) to load the 10 Millions rows of dataframe using my personal notebook.

Is there any method/codes can be used to speed up the processing for such big dataframe?

Dataframe RTT columns: ['TIMESTAMP', 'ID', 'RTT'] Dataframe RSRP columns: ['TIMESTAMP', 'ID', 'RSRP'] Result dataframe columns: ['TIMESTAMP', 'ID', 'RTT', 'RSRP']

Below are my codes: `` def map_RSRP(timestamp, id): rsrp = np.nan time_diff = np.nan

past_5mins = timestamp - timedelta(minutes = 5)
future_5mins = timestamp + timedelta(minutes = 5)

#Get the RSRP data for same ID with timestamp at 5 minutes difference from RTT timestamp
df_rsrp = df_RSRP.loc[(df_RSRP['ID'] == id) & 
                   (df_RSRP['TIMESTAMP'] >= past_5mins) & 
                   (df_RSRP['TIMESTAMP'] <= future_5mins)]

df_rsrp['Time Difference (minutes)'] = ((df_rsrp['TIMESTAMP'] - timestamp).dt.total_seconds()/60).astype(int)
if len(df_rsrp) > 0:
    time_nearest = abs(df_rsrp['Time Difference (minutes)']).min()
    df_rsrp = df_rsrp[abs(df_rsrp['Time Difference (minutes)']) == time_nearest]
    rsrp = df_rsrp['RSRP'].mean()
return rsrp

Result dataframe columns: ['TIMESTAMP', 'ID', 'RTT', 'RSRP']

df_rtt['RSRP'] = df_rtt.apply(lambda x : pd.Series(map_RSRP(x['TIMESTAMP'],x['ID'])), axis ``