seoulsky-field / CXRAIL-dev

CXRAIL-dev
MIT License
7 stars 0 forks source link

Features & Discussion: BRAX data train/valid/test split #103

Closed seoulsky-field closed 1 year ago

seoulsky-field commented 1 year ago

What

Give a guideline that split train, valid, and test of BRAX dataset.

Why

As deep learning mechanism, the dataset have to be splitted with train data and test data. However, BRAX dataset doesn't give default split values to users.

How

When I used a method splitted by the number of NaN values in each of rows, the class distribution was not balanced especially Edema. So, I want to suggest a new method.

And now, I'm trying to implement the process of 6. If you have any opinions or questions, feel free to ask.

kdg1993 commented 1 year ago

Thanks for taking a necessary job to run BRAX!

If you may, I have a few questions

  1. Could you please explain the meaning of 'used just once'? I think this one is the most important key factor in this process but it is a bit complicated for me
  2. Is there any reason why you didn't consider the other disease except Edema?
  3. Is there any reason for smaller validation set compared to the test set?
  4. I tried to find sklearn-multirun library but I couldn't. Can you give me a link to their documentation? Also, I think if you consider only patients' id and Edema class, sklearn's stratifiedgroupKfold will work
  5. The pursued ratio between train : validation : test is 0.64 : 0.16 : 0.2 and it is correct when one removes the lateral image right? If it is right, does it mean that the ratio including lateral images is not the same as 0.64 : 0.16 : 0.2?
seoulsky-field commented 1 year ago

@kdg1993

  1. "used just once" means the refers of patient id that has only one frontal image.
  2. Positive values in Edema column exist just 25 images. So, for stratified split, I thought it is necessary.
  3. I just use general split ratio. It's train+valid: test = 0.8: 0.2 and train: valid = 0.8: 0.2 again. So, it would be smaller than test data. However, it would be changed to set similar size between validation and test.
  4. http://scikit.ml/stratification.html is a skmultiearn library and now I'm using iterative-strafication library. (https://github.com/trent-b/iterative-stratification) I agree your opinion, however, I don't consider just Edema class.
  5. Yes.
seoulsky-field commented 1 year ago

I uploaded a notebook file has step-by-step progress about split BRAX dataset. If you have any questions or opinions after checking it, please feel free to ask.