Process of training sybil

Malaikah-Javed commented 3 weeks ago

Hey! I am going to train sybil on a subset of NLST dataset. I have the metadata and annotations json files.

What is the step by step process of Training Sybil?

I tried to run the train.py file but it gave no output.

And where will the newly trained model be saved after training? How do I evaluate it?

pgmikhael commented 3 weeks ago

Hi,

Please refer to the full train doc here to see the arguments that you need to pass to train.py.

The --save_dir determines where the model is saved. The full list of arguments (some of which may be irrelevant for training) are in parsing.py.

You can evaluate by running train.py but with the --test flag (and remove --train). You will need to set the path to your saved model. Alternatively, you can run the command-line class of Sybil, using your trained model, and the evaluate function.

Malaikah-Javed commented 3 weeks ago

Thank you!

Malaikah-Javed commented 2 weeks ago

Which arguments are irrelevant for training? For NLST subset. Which ones can i remove?

The NLST dataset has patient folders, and inside those folders are more folders named T0,T1,T2 (study year). How do it give the path to the images?

Malaikah-Javed commented 2 weeks ago

"a JSON file with metadata for each DICOM file", which metadata is this? is this necessary for training?
" Run [create_nlst_metadata_json.py] with the appropriate file paths as obtained from NLST.

source_json_path: is the file obtained from step 1 above output_json_path: will be the the outputted JSON to be used as dataset_file_path argument when training png_path_replace_pattern: is the pattern to replace in the dicom file paths with the file paths for the PNG files"

This is from here What are all these files required for exactly? i'm going to use DICOM images for my training

Malaikah-Javed commented 2 weeks ago

Could you also tell which files are these ones exactly? Because i have these files under different names so i might be giving the wrong input. Especially nlst_metadata_csv and nlst_imagedata_csv.

parser.add_argument('--nlst_abnormalities_csv', type = str, default = '/Mounts/rbg-storage1/datasets/NLST/package-nlst-564.2020-01-30/NLST_564/nlst_564.delivery.010220/nlst_564_ct_ab_20191001.csv') parser.add_argument('--nlst_metadata_csv', type = str, default = '/Mounts/rbg-storage1/datasets/NLST/package-nlst-564.2020-01-30/NLST_564/nlst_564.delivery.010220/nlst_564_prsn_20191001.csv') parser.add_argument('--nlst_imagedata_csv', type = str, default = '/Mounts/rbg-storage1/datasets/NLST/package-nlst-564.2020-01-30/NLST_564/nlst_564.delivery.010220/nlst_564_ct_image_info_20191001.csv') parser.add_argument('--test_google_splits', type = str, default = '/Mounts/rbg-storage1/datasets/NLST/Shetty_et_al(Google)/TEST_41591_2019_447_MOESM5_ESM.xlsx')

pgmikhael commented 2 weeks ago

Hi,

The JSON is created by parsing every dicom and collecting their paths in a hierarchical structure (participant -> exams -> series). We parse each individual dicom to preprocess the metadata (e.g., extract the pixel spacing, slice thickness, etc.). We also converted the DICOMs to PNG. If you do not follow these steps (your questions 1 & 2), then there are sections of the code that you need to modify / comment out.

The NLST CSV files are obtained from their release. Our data was obtained in 2019 so they might have updated the names / restructured some of these files.

nlst_abnormalities_csv contains the screening readings / abnormalities detected per screen
nlst_metadata_csv is the participant demographics / cancer diagnosis metadata
nlst_imagedata_csv contains technical parameters of the each series / CT obtained
test_google_splits is just a mapping to PIDs used in the test set of this work, which we used as a reference

Malaikah-Javed commented 2 weeks ago

Thank you! Do i need to comment it out in the nlst.py and loaders files as well? or only in create_nlst_dataset_json.py file? is png involved throughout the code?

Malaikah-Javed commented 2 weeks ago

And what are annotation masks? The mask array/list kept having "None" values, which gives an error/warning in the coming functions (e.g imsqueeze(), reshape_images())

pgmikhael commented 1 week ago

Hi,

If you're not using PNGs, then you may need to modify the dataset file. For the loaders, there is a dicom-specific loader (with pydicom) you can use instead of the default (cv2-based) loader.

The annotation masks are binary masks corresponding to manually constructed bounding boxes around some of the cancers. Their collection is described in the paper, and the annotation file is available in the same google drive as other metadata.

Malaikah-Javed commented 1 week ago

Alright thank you

Malaikah-Javed commented 5 days ago

In the NLST dataset, Study Year of Diagnosis is marked as T0, T1, T2 etc. There are different study years for patients. For example, T2 was the year in which one patient was diagnosed with cancer. Does that mean that the patient did not have lung cancer in the previous years T0 and T1?

If the patient didn't have cancer then, then how did you manage to label the training data's ground truth? And how would you give the corresponding paths? Because in the CSV files, ground truth is mentioned using PID, not study years.

josephcn932342 commented 4 days ago

Hi, Thank you for your efforts in this project!

I recently downloaded the package-nlst-780.2021-05-28.zip file from NLST website. However, I noticed that it only contains a limited number of files, unlike the "nlst_564_ct_image_info_20191001.csv" file mentioned in the discussion. Github_issue

Could you please let me know where I can access the missing files, or if there’s an alternative source? Any guidance would be greatly appreciated!

pgmikhael commented 4 days ago

There are different study years for patients. For example, T2 was the year in which one patient was diagnosed with cancer. Does that mean that the patient did not have lung cancer in the previous years T0 and T1? If the patient didn't have cancer then, then how did you manage to label the training data's ground truth?

They technically could have had cancer earlier, but for our purposes, we use the biopsy-confirmed cancer date to compute time to cancer. Otherwise, we use time to negative follow up scans to define the negative cases.

And how would you give the corresponding paths? Because in the CSV files, ground truth is mentioned using PID, not study years.

If you follow the nlst.py script, you can see that each series is associated with a list of paths to the CT.

Could you please let me know where I can access the missing files, or if there’s an alternative source?

This release is recent and is different from what was released to us in 2019. Most likely, you should write an alternative to create_nlst_metadata_json.py if you want to obtain a similar dataset file from these CSVs. At the end of the day, it's just about putting all the relevant data into one JSON.

reginabarzilaygroup / Sybil

Process of training sybil #52