illinois-or-research-analytics / cm_pipeline

Pipeline that uses an improved version of CM for generating well-connected clusters
GNU General Public License v3.0
5 stars 4 forks source link

workflow.py uses the original input network instead of the cleaned network #30

Closed MinhyukPark closed 1 year ago

MinhyukPark commented 1 year ago

Summary

When generating commands using the pipeline.jsons, the cleaned up network is not used consistently in all stages.

For example, the _cleanup.tsv network suffix is used to designate the output of the cleanup stage, a stage which is mandatory due to the assert statement. This network is correctly used in the first clustering stage. However, it is not used in the connectivity_modifier command.

This probably makes sense because the stages are initialized first and then the cleaned_file is fetched but the cleaned_file is not used to re init the stages and in fact it seems to be overridden right away by self.input_file, making the relevant for-loop potentially a no op.

How to replicate the issue

Run the pipeline with this json file. You'll notice that the input file is directly used in the CM stage, and not the cleaned up file.

{
    "title": "oc-ikc-k10",
    "name": "oc",
    "input_file": "/shared/open_citations_final_v2/oc_integer_cleaned_el.tsv",
    "output_dir": "output/",
    "algorithm": "ikc",
    "k": 10,
    "iterations": 1,
    "stages": [
        {
            "name": "cleanup"
        },
        {
            "name": "clustering",
            "parallel_limit": 2
        },
        {
            "name": "stats",
            "noktruss": true,
            "parallel_limit": 2
        },
        {
            "name": "filtering",
            "scripts": [
                "./scripts/subset_graph_nonetworkit_treestar.R",
                "./scripts/make_cm_ready.R"
            ]
        },
        {
            "name": "connectivity_modifier",
            "memprof": true,
            "threshold": "1log10",
            "nprocs": 32,
            "quiet": true
        },
        {
            "name": "filtering",
            "scripts": [
                "./scripts/post_cm_filter.R"
            ]
        },
        {
            "name": "stats",
            "noktruss": true,
            "parallel_limit": 2
        }
    ]
}
vikramr2 commented 1 year ago

This is a good note, marking this as higher priority

vikramr2 commented 1 year ago

Added the following snippet to workflow.py. Tested as well. Should fix the issue:

# Set network files post cleanup to the cleaned file
post_cleaned = False
for stage in self.stages:
    if post_cleaned:
        stage.set_network(cleaned_file)
    if stage.name == 'cleanup':
        post_cleaned = True