data preprocessing - Githubissues

ncoudray / DeepPATH

Classification of Lung cancer slide images using deep-learning

492 stars 213 forks source link

data preprocessing #48

Closed zhenzhenyang-psu closed 5 years ago

zhenzhenyang-psu commented 5 years ago

Hello Nicolas, I have two questions regarding the data preprocessing step. First is why the patient ID is 12. The image downloaded for the Lung cancer is for instance, "TCGA-44-6147-01B-05-BS5.B838E2DC-8869-4C72-9F1D-A066FF307579.svs".

The second question is how to get the ".json" file. The file i obtained for the lung cancer samples is only 1.8Mb, much smaller than the file posted under "example folder" in the software. The manifest file I obtained from TCGA website is the same as listed in the software.

Thanks for your time. Zhenzhen

ncoudray commented 5 years ago

Hi Zhenzhen,

For the bare code, please see https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/

The json file was obtained from the legacy portal (you must first put all the svs images in your cart then download the metadata

Best, Nicolas

zhenzhenyang-psu commented 5 years ago

Hi Nicolas, Thanks for your reply. Regarding the json file: The provided json file from the path below: DeepPATH-master/DeepPATH_code/example/metadata.cart.2017-03-02T00_36_30.276824.json has a file size of 17M. However, I follow the instructions you listed, it is only 1.8M. Below shows an attachment of how I downloaded the metadata. I worry whether this discrepancy would cause an error to the program. Because I am going to follow the same approach to download the json file for another type of cancer.

Would you like to check it for me? Thanks a lot! Zhenzhen json_file_explain.pdf

zhenzhenyang-psu commented 5 years ago

Hi Nicolas, I looked at the code to try to understand what "Patient ID" argument means while executing script "DeepPATH-master/DeepPATH_code/00_preprocessing/0d_SortTiles.py". However, I still don't understand how that is relevant to the barcode or name of the svs image file.

The way I understand it, it seems if I just want to split into train, test, and validation sets, specifying the patientID is irrelevant to my purpose. What do you think?

Attached shows where PatientID occurs in the code.

Sorry to bother. But I really appreciate your reply. Thanks a lot. Best, Zhenzhen

Screen Shot 2019-09-10 at 10 44 01 AM

ncoudray commented 5 years ago

As mentioned earlier and on the README file, for the json, you must use the legacy portal, not the new one (the screenshot you sent shows the new portal, I think): https://portal.gdc.cancer.gov/legacy-archive/search/f

For the patient ID, it is up to you whether you want to split the dataset per patient or not, in case you have multiple entries for a given patient or not

HTH, Best, Nicolas