stashapp / stash

An organizer for your porn, written in Go. Documentation: https://docs.stashapp.cc
https://stashapp.cc/
GNU Affero General Public License v3.0
8.94k stars 783 forks source link

[Bug Report] Images incorrectly merged on scan #4639

Open bdbenim opened 7 months ago

bdbenim commented 7 months ago

Describe the bug This is really two issues in one, but I'm submitting them as one both because they are tightly coupled and fixing either one would make the other much less important.

Sometimes when scraping a directory containing multiple galleries (each in their own subdirectory), two image files will incorrectly be identified as the same image. This is not a huge deal on its own, however there seems to be no way to manually split them apart afterwards like there is with scene files.

This seems to happen when the images have identical file sizes. The images I've seen it with so far have also had identical dimensions, but that could just be a coincidence because galleries tend to be filled with images of identical dimensions anyway. I've noticed though that each time this happens, the "File Info" tab of the image will show dimensions "0 x 0" for both files unless the image is rescanned.

To Reproduce Steps to reproduce the behavior:

  1. Scan a directory containing multiple subdirectories of galleries, where some images have identical file sizes
  2. Same-sized image files will be shown as the same image in the library (maybe?)

Expected behavior Ideally stash would not merge images that are not the same. Mainly however, the user should be able to fix such mistakes manually when they do occur by splitting the files into separate images, just like with scenes. Refer to the screenshots below.

Screenshots Here are the options shown for files other than the primary file for an image: image

Here are the options for additional files in a scene: image

Stash Version: (from Settings -> About): v0.24.3

Desktop (please complete the following information):

Additional context In my case this scan was performed shortly after copying several GBs of images from another machine. It's possible that, in addition to the files having identical sizes, if the copy was not completely finished then some of the files might have been empty. I think the transfer was done, but wouldn't bet my life on it. If that's the case, then the files would conceivably have the same hash as well, which might be why they were merged when scanning.

I also have the setting enabled to create galleries from folders of images. I doubt that is affecting this behaviour, but you never know.

bdbenim commented 7 months ago

Ok so I just tried deleting some of the merged images from my library and re-scanning the same directory, and this time they were imported as separate images, which I think supports my guess that those files hadn't finished copying when they were originally scanned. While I'd still call the resulting behaviour a bug, it's definitely an edge case and I don't think it's unreasonable to say "make sure your files are fully written before telling stash to scan them."

That said, a simple fix might be to ignore these files when scanning, which seems to be what is already done with scenes from what I can tell. That way, a later scan could pick up the files that weren't imported originally, without the user having to delete or split the merged images.

WithoutPants commented 7 months ago

As far as I'm aware, there's no simple way to determine if a file is being written to while scanning. I'm not aware of any intentional behaviour that detects video files being written to and ignoring them.

bdbenim commented 7 months ago

Yes, sorry, I meant files that are empty or otherwise "invalid" as images should be ignored, not necessarily files that are currently being written to. I'm assuming that if it interpreted the dimensions as 0x0 then there must have been something unusual about the image that could be detected by stash.

With so many different image formats out there I admit that I don't know how easy a task like "validating" an image truly is. And of course you could run into a file that doesn't correctly follow the specification for a given format but is still decodable, so you probably don't want to throw it out as "invalid". But I feel like it's at least safe to ignore an image that's empty altogether.

Edit: regarding video files, I haven't looked at the code, so all I know is that if I have a video that can't be played even by VLC trying its best, then stash generally seems to ignore it too. Not sure what logic is actually happening under the hood, maybe it's something like an error being thrown when trying to encode the preview or if there's an actual validation step. I see this with torrents periodically because I might skip downloading some files, but they'll still sometimes get created and have a small amount of data written if a torrent chunk contains data from more than one file. Depending on how much gets written, sometimes these scenes get added to stash as partially playable/partially corrupt videos, and other times they are skipped entirely.