Dataset issue - Githubissues

Kannadasa commented 4 years ago

Hi,

I am testing your model, but i am not getting the desired output. I think i am not distributing the data properly in train and valid folders.

Please let me know how you are creating the folder structure and loading the images for train and valid datasets. This is for binary classification

LelisThanos commented 4 years ago

Hello, same issues here, having trouble reproducing your code with loading and distributing images for train and validation datasets.

muhammedtalo commented 4 years ago

You may use https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html for spiting the datasets. I have also provided our results for three classes. Please see COVID-19 main repository.

Kannadasa commented 4 years ago

Do we know the actual results of the X-Ray images ? or can i assume that all 125 x-ray images inside Covid-19 folder are covid-19 positive ?

Thanks Kannadasan

muhammedtalo commented 4 years ago

Do we know the actual results of the X-Ray images ? or can i assume that all 125 x-ray images inside Covid-19 folder are covid-19 positive ?

Thanks Kannadasan Yes, the X-ray images inside the Covid-19 folder are covid-19 positive. The folder names are given in terms of diagnosis results.

Kannadasa commented 4 years ago

Hi Muhammed,

Have you got the code to implement K-Fold on the datasets ?

Thanks Kannadasan

Kannadasa commented 4 years ago

Hi Muhammed, By having 125 images in Covid-19 and 500 images in No_findings folder, are we not dealing with imbalanced dataset ?

The reason why i am asking is i trained your model using KFold datasets but i am getting only 58% accuracy. I am printing below one of the iteration output and I think somewhere something is wrong in my code.

epoch | train_loss | valid_loss | accuracy | time

0 | 0.003996 | 0.006837 | 1.000000 | 02:26 1 | 0.002480 | 0.004546 | 1.000000 | 02:10 2 | 0.001856 | 0.002552 | 1.000000 | 02:06 3 | 0.001408 | 0.001160 | 1.000000 | 02:04 4 | 0.001097 | 0.000621 | 1.000000 | 02:06 5 | 0.000911 | 0.000315 | 1.000000 | 02:11 6 | 0.000743 | 0.000152 | 1.000000 | 02:10 7 | 0.000620 | 0.000084 | 1.000000 | 02:12 8 | 0.000522 | 0.000066 | 1.000000 | 02:10 9 | 0.000442 | 0.000044 | 1.000000 | 02:09 10 | 0.000372 | 0.000033 | 1.000000 | 02:10 11 | 0.000316 | 0.000022 | 1.000000 | 02:09 12 | 0.000272 | 0.000018 | 1.000000 | 02:10 13 | 0.000233 | 0.000018 | 1.000000 | 02:10 14 | 0.000201 | 0.000017 | 1.000000 | 02:08 15 | 0.000173 | 0.000017 | 1.000000 | 02:10 16 | 0.000149 | 0.000015 | 1.000000 | 02:08 17 | 0.000129 | 0.000014 | 1.000000 | 02:10 18 | 0.000112 | 0.000014 | 1.000000 | 02:06 19 | 0.000097 | 0.000014 | 1.000000 | 02:07 20 | 0.000084 | 0.000015 | 1.000000 | 02:05 21 | 0.000074 | 0.000014 | 1.000000 | 02:07 22 | 0.000064 | 0.000014 | 1.000000 | 02:07 23 | 0.000056 | 0.000011 | 1.000000 | 02:07 24 | 0.000049 | 0.000010 | 1.000000 | 02:07 25 | 0.000043 | 0.000009 | 1.000000 | 02:07 26 | 0.000038 | 0.000009 | 1.000000 | 02:09 27 | 0.000034 | 0.000008 | 1.000000 | 02:10 28 | 0.000030 | 0.000007 | 1.000000 | 02:10 29 | 0.000026 | 0.000007 | 1.000000 | 02:10 30 | 0.000023 | 0.000007 | 1.000000 | 02:10 31 | 0.000020 | 0.000007 | 1.000000 | 02:11 32 | 0.000018 | 0.000007 | 1.000000 | 02:06 33 | 0.000016 | 0.000006 | 1.000000 | 02:07 34 | 0.000014 | 0.000006 | 1.000000 | 02:06 35 | 0.000012 | 0.000006 | 1.000000 | 02:08 36 | 0.000011 | 0.000005 | 1.000000 | 02:07 37 | 0.000010 | 0.000005 | 1.000000 | 02:07 38 | 0.000009 | 0.000005 | 1.000000 | 02:07 39 | 0.000008 | 0.000005 | 1.000000 | 02:10 40 | 0.000007 | 0.000005 | 1.000000 | 02:09 41 | 0.000006 | 0.000005 | 1.000000 | 02:10 42 | 0.000006 | 0.000005 | 1.000000 | 02:08 43 | 0.000005 | 0.000005 | 1.000000 | 02:10 44 | 0.000005 | 0.000004 | 1.000000 | 02:11 45 | 0.000004 | 0.000004 | 1.000000 | 02:12 46 | 0.000004 | 0.000004 | 1.000000 | 02:10 47 | 0.000003 | 0.000004 | 1.000000 | 02:13 48 | 0.000003 | 0.000004 | 1.000000 | 02:06 49 | 0.000003 | 0.000004 | 1.000000 | 02:07 50 | 0.000003 | 0.000004 | 1.000000 | 02:12 51 | 0.000002 | 0.000004 | 1.000000 | 02:10 52 | 0.000002 | 0.000004 | 1.000000 | 02:12 53 | 0.000002 | 0.000004 | 1.000000 | 02:10 54 | 0.000002 | 0.000004 | 1.000000 | 02:10 55 | 0.000002 | 0.000004 | 1.000000 | 02:07 56 | 0.000002 | 0.000003 | 1.000000 | 02:09 57 | 0.000001 | 0.000003 | 1.000000 | 02:10 58 | 0.000001 | 0.000004 | 1.000000 | 02:08 59 | 0.000001 | 0.000004 | 1.000000 | 02:09 60 | 0.000001 | 0.000004 | 1.000000 | 02:10 61 | 0.000001 | 0.000004 | 1.000000 | 02:12 62 | 0.000001 | 0.000004 | 1.000000 | 02:09 63 | 0.000001 | 0.000004 | 1.000000 | 02:09 64 | 0.000001 | 0.000003 | 1.000000 | 02:08 65 | 0.000001 | 0.000003 | 1.000000 | 02:09 66 | 0.000001 | 0.000003 | 1.000000 | 02:10 67 | 0.000001 | 0.000004 | 1.000000 | 02:09 68 | 0.000001 | 0.000004 | 1.000000 | 02:11 69 | 0.000001 | 0.000003 | 1.000000 | 02:08 70 | 0.000001 | 0.000003 | 1.000000 | 02:09 71 | 0.000001 | 0.000003 | 1.000000 | 02:08 72 | 0.000001 | 0.000003 | 1.000000 | 02:08 73 | 0.000001 | 0.000003 | 1.000000 | 02:07 74 | 0.000001 | 0.000003 | 1.000000 | 02:06 75 | 0.000001 | 0.000003 | 1.000000 | 02:08 76 | 0.000001 | 0.000003 | 1.000000 | 02:07 77 | 0.000001 | 0.000003 | 1.000000 | 02:07 78 | 0.000001 | 0.000003 | 1.000000 | 02:07 79 | 0.000001 | 0.000003 | 1.000000 | 02:07 80 | 0.000001 | 0.000003 | 1.000000 | 02:06 81 | 0.000000 | 0.000004 | 1.000000 | 02:08 82 | 0.000000 | 0.000003 | 1.000000 | 02:11 83 | 0.000000 | 0.000003 | 1.000000 | 02:08 84 | 0.000000 | 0.000003 | 1.000000 | 02:10 85 | 0.000000 | 0.000003 | 1.000000 | 02:09 86 | 0.000000 | 0.000003 | 1.000000 | 02:08 87 | 0.000000 | 0.000003 | 1.000000 | 02:08 88 | 0.000000 | 0.000003 | 1.000000 | 02:09 89 | 0.000000 | 0.000003 | 1.000000 | 02:09 90 | 0.000000 | 0.000003 | 1.000000 | 02:08 91 | 0.000000 | 0.000003 | 1.000000 | 02:08 92 | 0.000000 | 0.000003 | 1.000000 | 02:09 93 | 0.000000 | 0.000003 | 1.000000 | 02:09 94 | 0.000000 | 0.000003 | 1.000000 | 02:09 95 | 0.000000 | 0.000003 | 1.000000 | 02:09 96 | 0.000000 | 0.000003 | 1.000000 | 02:11 97 | 0.000000 | 0.000003 | 1.000000 | 02:11 98 | 0.000000 | 0.000003 | 1.000000 | 02:09 99 | 0.000000 | 0.000003 | 1.000000 | 02:08

muhammedtalo commented 4 years ago

Hi Muhammed, By having 125 images in Covid-19 and 500 images in No_findings folder, are we not dealing with imbalanced dataset ?

The reason why i am asking is i trained your model using KFold datasets but i am getting only 58% accuracy. I am printing below one of the iteration output and I think somewhere something is wrong in my code.

epoch | train_loss | valid_loss | accuracy | time

0 | 0.003996 | 0.006837 | 1.000000 | 02:26 1 | 0.002480 | 0.004546 | 1.000000 | 02:10 2 | 0.001856 | 0.002552 | 1.000000 | 02:06 3 | 0.001408 | 0.001160 | 1.000000 | 02:04 4 | 0.001097 | 0.000621 | 1.000000 | 02:06 5 | 0.000911 | 0.000315 | 1.000000 | 02:11 6 | 0.000743 | 0.000152 | 1.000000 | 02:10 7 | 0.000620 | 0.000084 | 1.000000 | 02:12 8 | 0.000522 | 0.000066 | 1.000000 | 02:10 9 | 0.000442 | 0.000044 | 1.000000 | 02:09 10 | 0.000372 | 0.000033 | 1.000000 | 02:10 11 | 0.000316 | 0.000022 | 1.000000 | 02:09 12 | 0.000272 | 0.000018 | 1.000000 | 02:10 13 | 0.000233 | 0.000018 | 1.000000 | 02:10 14 | 0.000201 | 0.000017 | 1.000000 | 02:08 15 | 0.000173 | 0.000017 | 1.000000 | 02:10 16 | 0.000149 | 0.000015 | 1.000000 | 02:08 17 | 0.000129 | 0.000014 | 1.000000 | 02:10 18 | 0.000112 | 0.000014 | 1.000000 | 02:06 19 | 0.000097 | 0.000014 | 1.000000 | 02:07 20 | 0.000084 | 0.000015 | 1.000000 | 02:05 21 | 0.000074 | 0.000014 | 1.000000 | 02:07 22 | 0.000064 | 0.000014 | 1.000000 | 02:07 23 | 0.000056 | 0.000011 | 1.000000 | 02:07 24 | 0.000049 | 0.000010 | 1.000000 | 02:07 25 | 0.000043 | 0.000009 | 1.000000 | 02:07 26 | 0.000038 | 0.000009 | 1.000000 | 02:09 27 | 0.000034 | 0.000008 | 1.000000 | 02:10 28 | 0.000030 | 0.000007 | 1.000000 | 02:10 29 | 0.000026 | 0.000007 | 1.000000 | 02:10 30 | 0.000023 | 0.000007 | 1.000000 | 02:10 31 | 0.000020 | 0.000007 | 1.000000 | 02:11 32 | 0.000018 | 0.000007 | 1.000000 | 02:06 33 | 0.000016 | 0.000006 | 1.000000 | 02:07 34 | 0.000014 | 0.000006 | 1.000000 | 02:06 35 | 0.000012 | 0.000006 | 1.000000 | 02:08 36 | 0.000011 | 0.000005 | 1.000000 | 02:07 37 | 0.000010 | 0.000005 | 1.000000 | 02:07 38 | 0.000009 | 0.000005 | 1.000000 | 02:07 39 | 0.000008 | 0.000005 | 1.000000 | 02:10 40 | 0.000007 | 0.000005 | 1.000000 | 02:09 41 | 0.000006 | 0.000005 | 1.000000 | 02:10 42 | 0.000006 | 0.000005 | 1.000000 | 02:08 43 | 0.000005 | 0.000005 | 1.000000 | 02:10 44 | 0.000005 | 0.000004 | 1.000000 | 02:11 45 | 0.000004 | 0.000004 | 1.000000 | 02:12 46 | 0.000004 | 0.000004 | 1.000000 | 02:10 47 | 0.000003 | 0.000004 | 1.000000 | 02:13 48 | 0.000003 | 0.000004 | 1.000000 | 02:06 49 | 0.000003 | 0.000004 | 1.000000 | 02:07 50 | 0.000003 | 0.000004 | 1.000000 | 02:12 51 | 0.000002 | 0.000004 | 1.000000 | 02:10 52 | 0.000002 | 0.000004 | 1.000000 | 02:12 53 | 0.000002 | 0.000004 | 1.000000 | 02:10 54 | 0.000002 | 0.000004 | 1.000000 | 02:10 55 | 0.000002 | 0.000004 | 1.000000 | 02:07 56 | 0.000002 | 0.000003 | 1.000000 | 02:09 57 | 0.000001 | 0.000003 | 1.000000 | 02:10 58 | 0.000001 | 0.000004 | 1.000000 | 02:08 59 | 0.000001 | 0.000004 | 1.000000 | 02:09 60 | 0.000001 | 0.000004 | 1.000000 | 02:10 61 | 0.000001 | 0.000004 | 1.000000 | 02:12 62 | 0.000001 | 0.000004 | 1.000000 | 02:09 63 | 0.000001 | 0.000004 | 1.000000 | 02:09 64 | 0.000001 | 0.000003 | 1.000000 | 02:08 65 | 0.000001 | 0.000003 | 1.000000 | 02:09 66 | 0.000001 | 0.000003 | 1.000000 | 02:10 67 | 0.000001 | 0.000004 | 1.000000 | 02:09 68 | 0.000001 | 0.000004 | 1.000000 | 02:11 69 | 0.000001 | 0.000003 | 1.000000 | 02:08 70 | 0.000001 | 0.000003 | 1.000000 | 02:09 71 | 0.000001 | 0.000003 | 1.000000 | 02:08 72 | 0.000001 | 0.000003 | 1.000000 | 02:08 73 | 0.000001 | 0.000003 | 1.000000 | 02:07 74 | 0.000001 | 0.000003 | 1.000000 | 02:06 75 | 0.000001 | 0.000003 | 1.000000 | 02:08 76 | 0.000001 | 0.000003 | 1.000000 | 02:07 77 | 0.000001 | 0.000003 | 1.000000 | 02:07 78 | 0.000001 | 0.000003 | 1.000000 | 02:07 79 | 0.000001 | 0.000003 | 1.000000 | 02:07 80 | 0.000001 | 0.000003 | 1.000000 | 02:06 81 | 0.000000 | 0.000004 | 1.000000 | 02:08 82 | 0.000000 | 0.000003 | 1.000000 | 02:11 83 | 0.000000 | 0.000003 | 1.000000 | 02:08 84 | 0.000000 | 0.000003 | 1.000000 | 02:10 85 | 0.000000 | 0.000003 | 1.000000 | 02:09 86 | 0.000000 | 0.000003 | 1.000000 | 02:08 87 | 0.000000 | 0.000003 | 1.000000 | 02:08 88 | 0.000000 | 0.000003 | 1.000000 | 02:09 89 | 0.000000 | 0.000003 | 1.000000 | 02:09 90 | 0.000000 | 0.000003 | 1.000000 | 02:08 91 | 0.000000 | 0.000003 | 1.000000 | 02:08 92 | 0.000000 | 0.000003 | 1.000000 | 02:09 93 | 0.000000 | 0.000003 | 1.000000 | 02:09 94 | 0.000000 | 0.000003 | 1.000000 | 02:09 95 | 0.000000 | 0.000003 | 1.000000 | 02:09 96 | 0.000000 | 0.000003 | 1.000000 | 02:11 97 | 0.000000 | 0.000003 | 1.000000 | 02:11 98 | 0.000000 | 0.000003 | 1.000000 | 02:09 99 | 0.000000 | 0.000003 | 1.000000 | 02:08

It seems you are using the test set during the training.

Kannadasa commented 4 years ago

I have not created any testset during the training.

All i did was split the data using StratifiedKFold and split the data using 20%. That means KFOLD n_splits=5. Then i ran 5 iteration during training with 100 epochs.

In each iteration 20% of my entire dataset will act as testset.

I used Stratified KFold to split the data, this is to make sure some portion of testdata will be available during training.

for example : This is how data is split during training.

[ 25 26 27 28 ... 621 622 623 624] [ 0 1 2 3 ... 221 222 223 224] [ 0 1 2 3 ... 621 622 623 624] [ 25 26 27 28 ... 321 322 323 324] [ 0 1 2 3 ... 621 622 623 624] [ 50 51 52 53 ... 421 422 423 424] [ 0 1 2 3 ... 621 622 623 624] [ 75 76 77 78 ... 521 522 523 524] [ 0 1 2 3 ... 521 522 523 524] [100 101 102 103 ... 621 622 623 624]

Kannadasa commented 4 years ago

Also whatever the dataset i am using is training set and validation set. My testset is completely unseen x-ray images and the accuracy i am getting is 67%.

AmiZya commented 4 years ago

@Kannadasa can you please provide the code you used for KFolds?

Kannadasa commented 4 years ago

Please find below my code for KFolds.

from sklearn.model_selection import KFold from sklearn.model_selection import StratifiedKFold kf = KFold(n_splits=5) skf=StratifiedKFold(n_splits=5)

data= (ImageList.from_folder(path) .split_none() .label_from_folder() .transform(size=(256,256)) .databunch()).normalize(imagenet_stats)

df=data.to_df()

for train_index, test_index in skf.split(df.index, df['y']): print(len(train_index), len(test_index))

print((train_index), (test_index))

d = (ImageList.from_folder (path)
        .split_by_idxs(train_index, test_index)
        .label_from_folder()
        .transform(size = (256,256))
        .databunch(num_workers =0)).normalize(imagenet_stats)

AmiZya commented 4 years ago

Thanks, much appreciated.

On a side note did you manage to get higher accuracy? I'm running the model now and it sits around 78% for the the three classes model.

Kannadasa commented 4 years ago

Hi,

I did not test for 3 classes. I did test only 2 classes. My KFold code is also for 2 classes.

I am not getting good accuracy on unseen data. With the training set and validation set the model is working fine. I am not getting good accuracy on the new data which the model has not seen before.

Thanks Kannadasan

Kannadasa commented 4 years ago

Are you using KFold to split the data for 3 classes prediction?

Is my KFold split code working for you in 3 classes?

Thanks Kannadasan

Shambhujii commented 4 years ago

Hi,

I am testing your model, but i am not getting the desired output. I think i am not distributing the data properly in train and valid folders.

Please let me know how you are creating the folder structure and loading the images for train and valid datasets. This is for binary classification

i am also facing the same issue,,,i hope you have fixed this problem now,,,Please let me know how you are creating the folder structure and loading the images for train and valid datasets.

Kannadasa commented 4 years ago

Hi,

First of all you need to have a directory called train and valid, because Fastai will look for these names while running the code. I am using KFold cross validation to split the data into training and validation sets.

Please find below my code for KFolds.

from sklearn.model_selection import KFold from sklearn.model_selection import StratifiedKFold kf = KFold(n_splits=5) skf=StratifiedKFold(n_splits=5)

data= (ImageList.from_folder(path) .split_none() .label_from_folder() .transform(size=(256,256)) .databunch()).normalize(imagenet_stats)

df=data.to_df()

for train_index, test_index in skf.split(df.index, df['y']): print(len(train_index), len(test_index))

print((train_index), (test_index))

d = (ImageList.from_folder (path) .split_by_idxs(train_index, test_index) .label_from_folder() .transform(size = (256,256)) .databunch(num_workers =0)).normalize(imagenet_stats)

Shambhujii commented 4 years ago

Hi,

First of all you need to have a directory called train and valid, because Fastai will look for these names while running the code. I am using KFold cross validation to split the data into training and validation sets.

Please find below my code for KFolds.

from sklearn.model_selection import KFold from sklearn.model_selection import StratifiedKFold kf = KFold(n_splits=5) skf=StratifiedKFold(n_splits=5)

data= (ImageList.from_folder(path) .split_none() .label_from_folder() .transform(size=(256,256)) .databunch()).normalize(imagenet_stats)

df=data.to_df()

for train_index, test_index in skf.split(df.index, df['y']): print(len(train_index), len(test_index))

print((train_index), (test_index))

d = (ImageList.from_folder (path) .split_by_idxs(train_index, test_index) .label_from_folder() .transform(size = (256,256)) .databunch(num_workers =0)).normalize(imagenet_stats)

Thank you so much my friend for this valuable comment,,,I will try to split train and validation sets as per your guidance,,Thank you again,,,Lets collaborate together to fight against this pandemic.

Kannadasa commented 4 years ago

It works fine for me in the training set and validation set. If i show some unseen x-ray images to my model, the model does not predict well. I dont know how to fix this problem. If you get any solution please let me know.

Thanks

rahuls321 commented 4 years ago

Hi,

First of all you need to have a directory called train and valid, because Fastai will look for these names while running the code. I am using KFold cross validation to split the data into training and validation sets.

Please find below my code for KFolds.

from sklearn.model_selection import KFold from sklearn.model_selection import StratifiedKFold kf = KFold(n_splits=5) skf=StratifiedKFold(n_splits=5)

data= (ImageList.from_folder(path) .split_none() .label_from_folder() .transform(size=(256,256)) .databunch()).normalize(imagenet_stats)

df=data.to_df()

for train_index, test_index in skf.split(df.index, df['y']): print(len(train_index), len(test_index))

print((train_index), (test_index))

d = (ImageList.from_folder (path) .split_by_idxs(train_index, test_index) .label_from_folder() .transform(size = (256,256)) .databunch(num_workers =0)).normalize(imagenet_stats)

Hey @Kannadasa , I successfully run the code for normal splitting like 80% for training and 10% for validation and 10% for testing. But I'm still facing issues with KFold cross-validation. After creating a train and valid dir. this code didn't produce anything. Could you please give a brief about this code?

Thanks in advance

aliranic commented 3 years ago

Hello! Why you used validation dataset as test dataset? Or you did different thing that I don't understand?

muhammedtalo / COVID-19

Dataset issue #2