hollowstrawberry / kohya-colab

Accessible Google Colab notebooks for Stable Diffusion Lora training, based on the work of kohya-ss and Linaqruf
GNU General Public License v3.0
564 stars 79 forks source link

[Feature Request] Recursive Thru Folder #118

Open RorutopThe2nd opened 3 months ago

RorutopThe2nd commented 3 months ago

Trainer Notebooks like Linaqruf's XL Trainer and Kohya SS does that, so would love to have this without just moving all of the files to one folder This should apply to taggers too

hollowstrawberry commented 2 months ago

Could you elaborate on what you mean?

Currently you can create subfolders in the tagger, and tag them one at a time. You can also choose multiple folders to train in the Extras of the trainer. You must specify how many repeats for each folder.

RorutopThe2nd commented 2 months ago
  1. Yeah i can tag them one to one, but what bout if I have so many folders? Do I have to use os.walk on the tagger command? Otherwise the recursive option should be added default image
  2. Sheesh before I'm about to talk about the multi datasets problem on trainer, I forgot about that mutiple folder in datasets cell.. Maybe if not assigned to datasets.subsets you should make the non assigned subfolders default with the trainer's repeat images settings.
hollowstrawberry commented 2 months ago

That is a good idea, but would require more effort than I'm willing to spend right now. It's a bit harder than it seems.

RorutopThe2nd commented 2 months ago

Got bored, try this. I think it should work But it might be kinda messy because the custom_dataset was None at the start

def validate_dataset():
  global lr_warmup_steps, lr_warmup_ratio, caption_extension, keep_tokens
  supported_types = (".png", ".jpg", ".jpeg", ".webp", ".bmp")

  print("\nšŸ’æ Checking dataset...")
  if not project_name.strip() or any(c in project_name for c in " .()\"'\\/"):
    print("šŸ’„ Error: Please choose a valid project name.")
    return

  datasets = []
  print(custom_dataset)
  if len(custom_dataset)>0:
    try:
      datconf = toml.loads(custom_dataset)
      for d in datconf["datasets"][0]["subsets"]: datasets.append(d)
    except:
      print(f"šŸ’„ Error: Your custom dataset is invalid or contains an error! Please check the original template.")
      return
  leftover_folders = [root for root,dirs,files in os.walk(images_folder) if os.path.isdir(root) and not os.path.relpath(root,images_folder) == "."]
  print(leftover_folders)

  folderExclude = []
  for d in datasets:
    if d.get("image_dir") in leftover_folders: folderExclude.append(d)

  for f in [_f for _f in leftover_folders if not _f in folderExclude]:
    datasets.append({
        'image_dir': f,
        'num_repeats': num_repeats
    })

  print(datasets)

  reg = [d.get("image_dir") for d in datasets if d.get("is_reg", False)]
  datasets_dict = {d["image_dir"]: d["num_repeats"] for d in datasets}
  folders = datasets_dict.keys()
  files = [f for folder in folders for f in os.listdir(folder)]
  images_repeats = {folder: (len([f for f in os.listdir(folder) if f.lower().endswith(supported_types)]), datasets_dict[folder]) for folder in folders}

  print(images_repeats)

  for folder in folders:
    if not os.path.exists(folder):
      print(f"šŸ’„ Error: The folder {folder.replace('/content/drive/', '')} doesn't exist.")
      return
  for folder, (img, rep) in images_repeats.items():
    if not img:
      print(f"šŸ’„ Error: Your {folder.replace('/content/drive/', '')} folder is empty.")
      return
  for f in files:
    if not f.lower().endswith((".txt", ".npz")) and not f.lower().endswith(supported_types):
      print(f"šŸ’„ Error: Invalid file in dataset: \"{f}\". Aborting.")
      return