jfilter / split-folders

🗂 Split folders with files (i.e. images) into training, validation and test (dataset) folders
MIT License
412 stars 72 forks source link

Split files with same prefix together #14

Closed Tato14 closed 2 years ago

Tato14 commented 4 years ago

I have some experiments where I crop the image in small tiles. All of them shared the same prefix. Is it possible to keep the tiles from the same source image either in train or valid?

Thanks!

jfilter commented 4 years ago

Add optional argument group_prefix in 0.4.0.

group_prefix needs the length of the group. So set it to group_prefix=2 if you have an image and a text for each item.

Tato14 commented 3 years ago

group_prefix seems to take into account the whole filename and separate files using the extension. I was thinking more on something that could use a prefix in the filename. It will give an example for the sake of clarity. Imagine I have a set of images like:

485092_Soft_DXm.1.2.840.113619.2.401.101117117236165.6548190722132000.31.jpeg
485092_Soft_DXm.1.2.840.113619.3.401.10111711723611.101117117236165.jpeg
1037264_Normal_DXm.1.2.840.113619.2.401.101117117236165.29256180320170934.3.jpeg
1522377_Normal_DXm.1.2.840.113619.2.401.101117117236165.13665191212160814.3.jpeg
1338551_Normal_DXm.1.2.840.113619.2.401.101117117236165.14135180423173036.7.jpeg
1094100_Hard_DXm.1.2.840.113619.2.401.101117117236165.14398190521104701.11.jpeg
1094100_Normal_DXm.1.2.840.113619.2.401.1011171172361636165.141351804231.jpeg

I would like to use the first numbers before _ to group images and keep them in the same folder. In this case, the first and second files (485092*) and the last two (1094100*) will group together.

jfilter commented 3 years ago

No, the prefix is derived dynamically based on the number of files that belong to one group. It only works if the number of fields for each group is the same. In your example the number of files for each group is different. Is your example a real-world scenario?

Tato14 commented 3 years ago

Yes, this is a real world scenario.

Sample with the same ID belong to the same patient but images have been took differently. I could manage to have the same number of characters for the prefix if that may help.

Another example would be the PANDA kaggle challenge. In this dataset, you have huge images with sparse information. One strategy in this dataset would be to tile the image in small parts- In there, you could generate tiles that all of them will bear the same sample name but may differ in the coordinates where you extracted the images.

jfilter commented 2 years ago

I understand that this is a real world scenario. But for me, this is outside the scope of this package. If anybody wants to implement it, please open a new issue.