Training a ControlNet to generate furnished room -> empty room (and vice versa). Improvement plateau...

I'm working on a project to take images of furnished rooms and remove all the furniture. I've got a large dataset of image pairs. I'm not using any preprocessing on the images so as to allow the model to preserve details of the original image (wall color, floor material, etc.).

After training on a 4090 for about 5 days, and I'm no longer seeing any improvement (see examples below).

I'm looking to get tips about where to go from here.

Does it just need to be trained longer?
Do I need to adjust the learning rate?
Should I spend more time cleaning the dataset (a small % of the dataset is probably bad, as you can see in one of the examples below, the target image is dark).
Should I preprocess the image to simplify this? (i.e MLSD) It would lose the details of the original, but maybe at least will provide better output for final image.
Perhaps ControlNet isn't the right arch for this and instead use pix2pix?

Thanks for the help!