liupei101 / Pipeline-Processing-TCGA-Slides-for-MIL

This repo provides an exhaustive pipeline of processing TCGA whole-slide images for downstream multiple instance learning.
32 stars 7 forks source link

Inquiry on Dataset Label File Creation & Clarification on Feature File Count Discrepancy #2

Closed blz822 closed 1 month ago

blz822 commented 1 month ago

Thank you very much for your outstanding work, particularly the comprehensive tutorial on processing The Cancer Genome Atlas (TCGA) Whole-Slide Images (WSIs), which is invaluable for downstream computational tasks like slide classification and survival analysis, especially employing Multiple-Instance Learning (MIL) as the learning paradigm.

Building upon the project introduction, I aspire to take it a step further by leveraging the CLAM framework for renal cell carcinoma (RCC) classification tasks. As outlined in CLAM, each dataset ought to reside as a subfolder (e.g., DATASET_1_DATA_DIR) beneath the DATA_ROOT_DIR, with features extracted for every slide saved as a .pt file within the pt_files folder of said subfolder. Additionally, datasets should be formatted in CSV, mandating at least three columns: case_id, slide_id, and one or more columns for slide-level labels. Here, case_id serves as a unique patient identifier, whereas slide_id corresponds to the name of the extracted feature .pt file, acting as a slide identifier.

Nonetheless, I find myself unclear about the precise methodology for generating dataset label files akin to "tumor_subtyping_dummy_clean.csv". Could you kindly furnish either a sample file or an exhaustive guide outlining the steps and considerations necessary when crafting such CSV files?

Furthermore, following the execution of "Step S04: Feature Extraction from Patches," I encountered the following output: This step produced 937 feature files in the directory /ExpData/tcga_rcc/tiles-20x-s256/feats-CTransPath/pt_files. Meanwhile, Step S03 generated 937 patch files in the directory /ExpData/tcga_rcc/tiles-20x-s256/patches. All slides in the patch directory have undergone processing in this step.

Given that the expected count is 940 feature files, I am concerned that the discrepancy in my generated count of 937 might indicate an anomaly in the process. Could you advise if this deviation is indicative of an issue?

liupei101 commented 1 month ago

Thanks for your attention.

For your first issue, how to generate the dataset label file that could be used in CLAM, it does not need too much work to obtain that file, since, as you mentioned, it should be formatted in CSV and contains three key columns. Moreover, I am sorry that there is no plan to release a guideline about label file generation. Because the label file's content or format depends on the downstream tasks of interest, which is very flexible; a specific example of the label file could not provide too much help for general users.

For your second issue, it seems that your process had run as expected. The remaining three files are not processed by CLAM, as they only contain a single image level, i.e., the highest level 0, as mentioned in our S03 (just below the block In [2]:). For more details, you could check your running logs or tiles-20x-s256/process_list_autogen.csv.

blz822 commented 1 month ago

Many thanks for your swift and thorough response!

Regarding the creation of dataset label files for CLAM, I now comprehend that its customization is inherently tied to the specific downstream tasks, hence the absence of a universal guideline makes sense. With your advice in mind, I will tailor the CSV structure to fit my research objectives, ensuring it encompasses essential fields such as case_id, slide_id, and pertinent label columns. Should I encounter any further questions, I'll endeavor to find solutions independently or reach out to you again.

As for the discrepancy in the number of feature files, your clarification has allayed my concerns. Knowing that the 3 unprocessed files are a result of containing only the highest level 0 images aligns with the intended operation of the process. I intend to follow your suggestion by examining the log files and reviewing the tiles-20x-s256/process_list_autogen.csv to gain additional insights and verify that everything is proceeding as planned.

Once more, I am deeply appreciative of your valuable assistance and guidance. Your work forms a robust foundation for my current research endeavors and significantly bolsters my learning trajectory in this domain. I look forward to potential future exchanges and collaborations!

Warmest regards,