pnnl / deimos

BSD 3-Clause "New" or "Revised" License
31 stars 6 forks source link

partitions.zipmap() function: Makes the script restart #25

Open jessieolough opened 1 month ago

jessieolough commented 1 month ago

I am trying to automate DEIMoS' Reference-based Alignment steps, where the script loops through the .d5 files in the directory and aligns the files to a specified file. This works well with the procedure outlined in the Reference-based Alignment Documentation right up until the partitions.zipmap() function. When this function is run, the script will automatically restart/loop back to the beginning and will not allow the rest of the steps (SVR modelling, Applying the realignment) to take place.

I've tried adding if __name__ == "__main__": into the script (following guidance from here and here), but this makes no difference.

Do you know what could be causing this? Is there anything else I can add into the code to possibly help resolve this?

This function and the rest of the steps work fine when run directly in the terminal step-by-step, though this process will need to be automated in a script for it to be useful for us.

Any help would be much appreciated.

Many thanks, Jess

smcolby commented 1 month ago

The code ran fine in a Jupyter notebook as well, but after copy/pasting to a script I'm seeing this behavior as well. I have never seen something like this before (with deimos or Python in general), as it's running the entire script again for each of the multiple processes.

In the short term, setting processes=1 for this step seems to avoid the issue.

smcolby commented 1 month ago

Update: I also added the if __name__ == "__main__": to the script, which then executed successfully with processes=4.

jessieolough commented 1 month ago

Hi Sean,

Thanks for getting back about this. I've tried adding in the processes=1 or processes=4, though the issue still remains. This may be on my part, as I am not 100% sure where I should put the if __name__ == "__main__": and processes= in the script. Would you be able to advise on this, please?

Should the if __name__ == "__main__": go directly before the problematic function (as shown in the example code below) or should the it encase the entire loop (place just before for x in files:)? Or, should it go somewhere completely different? Same question for the processes=.

files = #list of files in the directory
reference_file = #reference data file

for x in files:

    ##Exclude reference file from the alignment steps
    if x == reference_file:
        pass

    #Perform Alignment on the rest of the files
    else: 
        data = {}
        #File to be aligned
        data['A'] = deimos.load(x, key='ms1')
        #Reference file
        data['B'] = deimos.load(reference_file, key='ms1')

        # Collapse
        a_rt = deimos.collapse(data['A'], keep='retention_time').sort_values(by='retention_time')
        b_rt = deimos.collapse(data['B'], keep='retention_time').sort_values(by='retention_time')
        # Visualize
        fig, ax = plt.subplots(1, dpi=150, facecolor='w')
        ax.fill_between(a_rt['retention_time'], a_rt['intensity'], color='C0', alpha=0.5, label='A')
        ax.fill_between(b_rt['retention_time'], b_rt['intensity'], color='C3', alpha=0.5, label='B')
        ax.set_xlabel('Retention Time', fontweight='bold')
        ax.set_ylabel('Intensity', fontweight='bold')
        ax.set_xlim(0, None)
        ax.set_ylim(0, None)
        plt.legend()
        plt.tight_layout()
        #Save memory space
        plt.close()

        del [a_rt, b_rt, fig, ax]

        # Perform peak detection
        peaks = {}
        peaks['A'] = deimos.peakpick.persistent_homology(deimos.threshold(data['A'], threshold=128),
                                                 dims=['mz', 'drift_time', 'retention_time'])
        peaks['B'] = deimos.peakpick.persistent_homology(deimos.threshold(data['B'], threshold=128),
                                                 dims=['mz', 'drift_time', 'retention_time'])
        # Downselect by persistence
        peaks['A']['persistence_ratio'] = peaks['A']['persistence'] / peaks['A']['intensity']
        peaks['A'] = deimos.threshold(peaks['A'], by='persistence_ratio', threshold=0.75)

        peaks['B']['persistence_ratio'] = peaks['B']['persistence'] / peaks['B']['intensity']
        peaks['B'] = deimos.threshold(peaks['B'], by='persistence_ratio', threshold=0.75)

        # Partition
        partitions = deimos.partition(deimos.threshold(peaks['A'], threshold=1E3),
                                      split_on='mz',
                                      size=1000,
                                      overlap=0.25)

        # Match
        if __name__ == "__main__": #<---------- Should it go here?
            a_matched, b_matched = partitions.zipmap(deimos.alignment.match, deimos.threshold(peaks['B'], threshold=1E3),
                                                     dims=['mz', 'drift_time', 'retention_time'],
                                                     tol=[20E-6, 0.03, 2], relative=[True, True, False],
                                                     processes=4)

        # Visualize
        fig, ax = plt.subplots(1, dpi=150, facecolor='w')
        ax.scatter(a_matched['retention_time'], b_matched['retention_time'], s=2)
        ax.plot([0, 25], [0, 25], linewidth=1, linestyle='--', color='k')
        ax.set_xlabel('Retention Time (A)', fontweight='bold')
        ax.set_ylabel('Retention Time (B)', fontweight='bold')
        ax.set_xlim(0, 25)
        ax.set_ylim(0, 25)
        plt.tight_layout()
        #Save memory space
        plt.close()

        # SVR spline
        spl = deimos.alignment.fit_spline(a_matched, b_matched, align='retention_time', kernel='rbf', C=1000)
        newx = np.linspace(0, a_matched['retention_time'].max(), 1000)

        # Visualize
        fig, ax = plt.subplots(1, dpi=150, facecolor='w')
        ax.plot(newx, spl(newx), c='black', linewidth=1, linestyle='--')
        ax.scatter(a_matched['retention_time'], b_matched['retention_time'], s=2)
        ax.set_xlabel('Retention Time (A)', fontweight='bold')
        ax.set_ylabel('Retention Time (B)', fontweight='bold')
        ax.set_xlim(0, 25)
        ax.set_ylim(0, 25)
        plt.tight_layout()
        plt.close()

        #Apply alignment
        data['A_aligned'] = data['A'].copy()
        data['A_aligned']['retention_time'] = spl(data['A_aligned']['retention_time'])
        # Collapse
        a_rt_aligned = deimos.collapse(data['A_aligned'], keep='retention_time').sort_values(by='retention_time')
        # Visualize
        fig, ax = plt.subplots(1, dpi=150, facecolor='w')
        ax.fill_between(a_rt_aligned['retention_time'], a_rt_aligned['intensity'], color='C0', alpha=0.5, label='spl(A)')
        ax.fill_between(b_rt['retention_time'], b_rt['intensity'], color='C3', alpha=0.5, label='B')
        ax.set_xlabel('Retention Time', fontweight='bold')
        ax.set_ylabel('Intensity', fontweight='bold')
        ax.set_xlim(0, None)
        ax.set_ylim(0, None)
        plt.legend()
        plt.tight_layout()
        plt.close()

Many thanks, Jess

smcolby commented 1 month ago

It goes at the top, after your imports:

import deimos

if __name__ == "__main__":
    print('all code here!')

A good explanation here.

jessieolough commented 2 weeks ago

The function works and the script is running smoothly now, thanks so much!