hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.36k stars 155 forks source link

One mapping db per remote tag service #1276

Open roachcord3 opened 1 year ago

roachcord3 commented 1 year ago

User story: I have a split db, and I take nightly backups of it using borg, but the PTR makes it so that client.mappings.db changes quite a bit almost every night, resulting in borg backups, after deduplication, still being around 20GB every night. This is not a storage concern, since I do periodic pruning, but it does make the backups and integrity checks take a long time to complete: when I was using lzma compression, it would make backups take ~5 hours and integrity checks would take 13+ hours. (I eventually switched to zstd, which is 10x faster and increases archive sizes by only around 50%, which is an acceptable tradeoff for me, but maybe not for someone who's space-constrained.) Yes I do somewhat invite this upon myself by having a split db, by using the PTR, by using compression, yada yada, but I'm sure many users are in the same boat as me.

If the PTR's tag service was its own db, then I could back it up weekly or monthly instead of nightly, without having to manually pause PTR processing. Of course, treating the PTR as a special case is never the goal of my FRs like this. I am interested in general solutions. It's easier to write code that treats the problem generically, and it makes the feature more flexible.


NOTE: originally, I worded this as a matter of making it purely generic, not differentiating between local and remote tagging services. I got pushback on that, and after cooling down, I realized that there is no need to have a separate db file per local tagging service. Ultimately, the problem I'm facing is with the PTR and would apply to any remote tagging service, which are under far less of the user's control than local tagging services are. I think a suitable solution for this would be to separate the remote tag service mapping dbs into their own files. For most users, this would mean just one remote tag service, the PTR, but for some, it might mean several db files, which would exceed the default limit of 10 attached dbs at a time. IMO, there could be a lot of benefit for users of deduplicating archivers like borg if each remote service got its own mapping db, but I understand that it adds complication, so an acceptable compromise would be "one separate mapping db for all remote tag services, together."


Click this for the original text of the rest of the issue, which is still relevant depending on whether you want to go with "one mapping db per remote tag service" or not. As discussed in discord, [the default hard limit for the maximum number of attached databases is 10](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.setlimit), but this can be increased by [recompiling sqlite with the macro `SQLITE_MAX_ATTACHED`](https://www.sqlite.org/c3ref/limit.html) set to [as high as _125_](https://www.sqlite.org/limits.html#max_attached). I found a handy blog post that, while focused on enabling other aspects of sqlite rather than increasing its attach limit, does show [how to compile sqlite for use with python](https://charlesleifer.com/blog/compiling-sqlite-for-use-with-python-applications/), so I figured it might be useful to share. Whether this means you would distribute a recompiled sqlite alongside hydrus or you would tell power users to do this work of recompiling and installing sqlite themselves, I don't know. Once the limit is sufficiently increased, you could split the mappings db into one db per tag service, plus an extra miscellaneous mapping db for everything that isn't a tag service, I guess. You know your db layout better than I do. You could also even add a warning to users who try to add more tag services than would work with the current connection limit (accessed with [`getlimit`](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.getlimit)), minus 2 or 3 for some future wiggle room for yourself to work with in case you want to add more standard dbs that would work within the default limit of 10. Other user stories for this change: * ~~New user wants to sync with PTR quickly. This change could make it easier to pass around a synced PTR db that wouldn't be absolutely huge, saving each user hundreds of hours of computation. Multiply this by thousands more users over time (plus, the PTR gets only bigger and longer to sync over time) and you've got savings in terms of millions of hours of computation.~~ Apparently, there's another issue with hash ID uniqueness that would also have to get addressed first before this benefit could be realized, so consider this moot. * (Probably niche) Would let a user, through symlinking, put mapping dbs that get more I/O on separate disks to increase total I/O bandwidth (probably applies mostly to spinning disks), or to put the wear on disks they're more willing to wear through faster.
thatfuckingbird commented 1 year ago

I don't like the idea of rearchitecting how the db works just because its not ideal for your specific backup settings. Should just change your backups. If you disabled compression for that backup target it would be much faster.

roachcord3 commented 1 year ago
open if you want to read a pointless internet argument @thatfuckingbird that's unnecessarily wasteful, unless you know a way to magically make my backup provider give me more bandwidth and storage for free. Perhaps you would be willing to pay to upgrade my plan? If so, very generous of you. If not, then maybe just consider that not every user story has to apply to you personally for it to be a valid one. Not to mention, having huge sqlite db files is never ideal. Splitting the dbs by tagging service could hypothetically improve app performance. It would also make it easier for people to distribute copies of their synced PTR to others, to speed up initial syncing, which would save potentially millions of hours of computation in the long run.
thatfuckingbird commented 1 year ago

It's not my problem you chose a backup provider that's too expensive or limited. The PTR file isn't even huge by sqlite standards, sqlite was designed to be efficient with far larger files than this. It wouldn't make it easier to distribute copies of the PTR, because the tag mappings, that take up all that space, use hash IDs that are unique to your database (this is why processing is needed at all instead of just copying in the PTR - if it was only a matter of having a table imported that could be done without a separate database). The only thing you can do is distribute an entire, otherwise empty, database with PTR preprocessed, which doesn't need this (and was done in the past already but abandoned). Additionally, sqlite in WAL mode cannot do atomic transactions over multiple attached databases, which might be a problem depending on how the databases are split up. Additional logic would also need to be implemented for the cases where there are more services than the attach limit.

roachcord3 commented 1 year ago
open if you want to read a pointless internet argument >It's not my problem you chose a backup provider that's too expensive or limited. I picked a popular, cheap one (borgbase.) As my original issue says, while this is partially self-inflicted because of my choice of compression algorithm, my choice is motivated by least-waste, which is a good enough motivation to justify doing it. Also, there are certainly others who are (or are going to be) in the same boat as me, so this isn't me asking for a major refactor just for purely selfish reasons. Also, not every github issue has to be "your problem." >It wouldn't make it easier to distribute copies of the PTR, because the tag mappings, that take up all that space, use hash IDs that are unique to your database (this is why processing is needed at all instead of just copying in the PTR - if it was only a matter of having a table imported that could be done without a separate database) That seems to be its own problem, then, isn't it? If it's not an intractable one, wouldn't it make sense to address that so that we could distribute the PTR easily like I was talking about? >Additionally, sqlite in WAL mode cannot do atomic transactions over multiple attached databases That's a trivial problem to work around; even a basic semaphore will do the trick. There are already multiple databases. I'm sure it's already being accounted for by the software. >Additional logic would also need to be implemented for the cases where there are more services than the attach limit. Re-read the original issue. That limit is effectively 125, and the "additional logic" would be to check `getlimit`, subtract 2 or 3, and reject any operation that would result in more databases than that.
thatfuckingbird commented 1 year ago

Also, not every github issue has to be "your problem."

  1. You begged me to pay for your shitty backup solution, hence making it my problem too.
  2. It will also be my problem if this braindead idea gets through, hence here I am pointing out why it is a very bad idea.

That seems to be its own problem, then, isn't it? If it's not an intractable one, wouldn't it make sense to address that so that we could distribute the PTR easily like I was talking about?

It is a very hard problem, hence why it hasn't been solved yet despite the huge upside of doing away with processing. My point was that this change would not make it easier in any way, so it is completely unrelated to this issue.

That's a trivial problem to work around; even a basic semaphore will do the trick.

"its ez bro just do X". Yeah maybe it would, but maybe it wouldn't. It would certainly need to be taken into account in all future service-related db changes. Data integrity is already an exceedingly hard problem where sqlite helps us a lot. Let's not make it harder to be helped than necessary.

Re-read the original issue. That limit is effectively 125, and the "additional logic" would be to check getlimit, subtract 2 or 3, and reject any operation that would result in more databases than that.

It is not effectively 125, it would be 125, if hydrus shipped its own sqlite for all platforms. Hydrus doesnt ship any custom compiled components for a very good reason. Making that work for all the supported platforms and architectures would be exceedingly difficult. The bundled mpv libs come closest (which still aren't custom compiled by hydev), and those are already a huge source of pain and user issues. For optional features it wouldnt be a huge deal to tell people to go compile their own sqlite if they want it, but here if you can't use the custom one, your hydrus would be crippled in a major way by the limit. Not to mention putting arbitrary limits like this on number of services is just bad design. You can't know where this will come back in the future to bite your ass (and you just know there already is some madman out there with >125 tag services. I would bet on it.)

Overall, while this could be implemented, I see a lot of downsides but no upsides besides making your backups faster, which would be better solved by you finding a better solution to do those backups.

Also, I just noticed that you not only want these services in separate files, you also want hydrus to gracefully handle cases where suddenly an older version of the service db gets swapped in (hence the weekly vs. daily backups you want for some databases). This is another huge complication potentially. What if there are references elsewhere to the missing content?

roachcord3 commented 1 year ago
open if you want to read a pointless internet argument @thatfuckingbird You're being overly confrontational and making stuff up, so I'm not going to respond to anything that isn't about the technical details. I'm happy to discuss technical issues with just about anyone, even a belligerent person, but I'm not going to engage with non-constructive language any more than is strictly necessary to keep us on-topic. >My point was that this change would not make it easier in any way, I ostensibly agree, because I view solving that problem as a prerequisite to making any meaningful progress on this feature request. >so it is completely unrelated to this issue. Technical blockers and prerequisites are never unrelated, which is why you brought it up initially, no? I'm confused why you say it's unrelated; I assume you meant to say something different. >Data integrity is already an exceedingly hard problem where sqlite helps us a lot. It's the transactional/relational database model that's helping, not sqlite. Standalone, multi-user RDBMSes like PostgreSQL and MySQL would be even better since they have actual DR stories beyond simply copying a huge db file around. I understand why sqlite was chosen over them, but let's not give it undeserved credit. Its shortcomings are precisely the reason why the built-in backup routine is so simplistic and locks the whole program up, and why the recommendation to users with split db scenarios is to just figure it out themselves. >Let's not make it harder to be helped than necessary. Of course, that's why I wouldn't request something like this if there wasn't a way to increase the attach limit. >Hydrus doesnt ship any custom compiled components for a very good reason. Making that work for all the supported platforms and architectures would be exceedingly difficult. There's already a sqlite binary in the db folder in the distributed releases, which means someone out there in the world compiled it at some point and got it to work in an easily-distributed way, and the build scripts bundle it in without users having compatibility issues, as far as I can tell, since I don't see people in Discord telling users to compile sqlite themselves already. So, "exceedingly difficult" seems like a stretch. It sounds more like you're trying to say "I wouldn't want to do it if I were hydev," which is valid and I respect your feelings on the matter. I assume hydev will have his own thoughts and feelings on the matter, which is why I said that it could be compiled as part of the build scripts and bundled with the releases, or it could be something the user has to do themselves if they want to go past the default attach limit. It's just something to consider, not this huge, intimidating problem that makes the feature request "braindead," per your own words. For that matter, neither is wrapping two atomic database transactions in series–into one function through the use of a semaphore or other locking mechanism of your choice–a particularly intimidating problem. It's pretty much concurrent programming 101, and I assume hydrus is already doing it in plenty of places, since it already has a handful of separate db files. >Not to mention putting arbitrary limits like this on number of services is just bad design. You can't know where this will come back in the future to bite your ass (and you just know there already is some madman out there with >125 tag services. I would bet on it.) This is the only valid criticism you've made, since it's the only thing where it actually affects the design and future capabilities of hydrus. (Everything else is implementation details, which are merely part of the cost-benefit analysis when deciding what to prioritize.) I anticipated that there would need to be implementation wiggle room for some extra dbs, which is why I said that the client would have to check the attach limit and subtract 2 or 3 from that and treat it as a limit. But, I can't say for sure if there's a valid use case for having over ~115 tag services; if there is, then my request would cause an unnecessary restriction. With that said, if there's someone out there with >125 tag services right now who's doing it for some insane, invalid use case, then he'd just have to cope, wouldn't he? What's more reasonable, compressing your backups or having >125 tag services? As far as the professional world is concerned, if a feature has a salient user story, then even if it's a hard feature to implement and wouldn't be worth taking a look at for the foreseeable future, especially because of external blockers, it's not "braindead." At worst, it's simply unable to make it onto the roadmap. If hydev doesn't want to work on this (for whatever reason, not just "it's not worth the effort" or "it's got more cons than pros,") then let him close it; however, I only opened this because he and I actually discussed the particulars in discord and he didn't seem opposed to it, and there was technical detail to capture and write down in one place so we don't lose track of it. >you also want hydrus to gracefully handle cases where suddenly an older version of the service db gets swapped in (hence the weekly vs. daily backups you want for some databases). This is another huge complication potentially. What if there are references elsewhere to the missing content? That totally depends. What dangling references are you anticipating? Mappings currently don't care about whether a file is present in your db, so I can't imagine that behavior wouldn't continue to be preserved. And that works because we're relying on universal hashes to identify files, correct? We already talked about how we would need to first solve the problem of tag hash IDs being universal before this would be a viable feature request, so, assuming that's solved, then would this behavior not also extend to tags? Also, doesn't the client already tolerate some degree of data loss in the database files, which could be caused by dangling references? I mean, I've had to restore a partially corrupted database before, and managed to get through it eventually; I also see plenty of people in discord get through worse issues of data loss in the dbs than I've ever experienced. On top of all this, I also know that hydrus works fine, albeit with fewer mappings than others, if your PTR sync is suspended. So, what's the issue with rolling back the PTR mappings a few weeks without rolling back the other dbs? Given my limited knowledge, I'll have to rely on you or another knowledgeable person to explain it to me the same way you explained the tag hash ID problem. I know that you know more about the inner workings of hydrus than I do; I mean, it says "collaborator" on your posts, after all, and you were the one to point out a blocker I didn't know about. If you didn't have anything particular in mind, though, I get it; you don't need to know the schemas to understand implicitly that, in the abstract, this is the type of programming problem where dangling references can be an issue. However, if it's just a vague feeling of "well, maybe there could be an issue of that sort," then I'd like to ask you to not make a mountain out of a molehill by calling it "another huge complication possibly." That's just FUD disguised as foresight. I mean no disrespect by saying this; I would say the same (albeit in a lot fewer words) to someone who's being nice but still spreading FUD. It's your attitude that forces me to go out of my way to be clearly respectful, lest you overreact again, like you did with my joke about paying for my borgbase subscription. You, me, and every poor sap reading this thread in the future could have been saved a bunch of time if both of us were following the community guidelines, instead of just me. Anyway, that's it for my responses to the technical points mired under the layers of insults, lies, and FUD over something as inconsequential as a feature request. I'm happy to discuss these or any other topics that occur to you like in your last paragraph, but let's focus on being constructive, ok? If we can be cool, let's be cool. We _should_ have a common goal here of making hydrus better, so let's act like it.
thatfuckingbird commented 1 year ago

@thatfuckingbird You're being overly confrontational and making stuff up, so I'm not going to respond to anything that isn't about the technical details.

I'm not making up anything. Btw. my first comment was 2 lines, explicitly stating a personal opinion. Read your own reply to that. I wasn't the confrontational one.

Technical blockers and prerequisites are never unrelated, which is why you brought it up initially, no? I'm confused why you say it's unrelated; I assume you meant to say something different.

No. You could have this feature with or without making the PTR portable. It is not a blocker. It is unrelated. It was you who brought it up, in the OP. I just pointed out that what you wrote there is wrong.

It's the transactional/relational database model that's helping, not sqlite.

Wrong. That is one part of it. There is a lot to take care of between an application calling BEGIN/COMMIT and the data actually being recorded to disk, and sqlite does a lot to make that part work as safely as possible. I do not want any change that could possibly affect this.

There's already a sqlite binary in the db folder in the distributed releases, which means someone out there in the world compiled it at some point and got it to work in an easily-distributed way, and the build scripts bundle it in without users having compatibility issues, as far as I can tell, since I don't see people in Discord telling users to compile sqlite themselves already. So, "exceedingly difficult" seems like a stretch.

Yes and now all the effort of "compiling it and get it working in an easily distributed way" will fall to hydrus. And what about everyone else not running the releases? That is at least everyone on Win7 or older and soon everyone on Linux too (if all goes to plan). It is a major pain and not worth it at all to deal with requiring custom binaries. Hydev might or might not want to compile his own sqlite for 2 or 3 platforms, but I sure as hell don't want to, even for 1 (I did it before, for a different compile-time limit that was affecting hydrus).

This is the only valid criticism you've made, since it's the only thing where it actually affects the design and future capabilities of hydrus.

No. Stating that from a technical POV this is a bad idea and not worth the cost at all is completely valid criticism. Who said we are only allowed to discuss abstract design and not the actual implementation?

With that said, if there's someone out there with >125 tag services right now who's doing it for some insane, invalid use case, then he'd just have to cope, wouldn't he? What's more reasonable, compressing your backups or having >125 tag services?

Could say the same to you. Why do you think that use case is invalid? Complicated backup solutions are out of scope for Hydrus, that's why it disables the built-in backup & recommends external tools for anything but the most basic db layout. The guy with >125 services is using hydrus as intended.

However, if it's just a vague feeling of "well, maybe there could be an issue of that sort," then I'd like to ask you to not make a mountain out of a molehill by calling it "another huge complication possibly."

That why I wrote "possibly". Seems clear enough to me. It's not FUD, it is a valid concern that should be addressed. Since you wanted examples, sync status and caching immediately comes to mind.

We should have a common goal here of making hydrus better, so let's act like it.

Pointing out why this change would be worse for everyone not using borg + a bandwith limited provider is part of this. That's all I'm here to do.

roachcord3 commented 1 year ago
open if you want to read a pointless internet argument >I'm not making up anything One_mapping_db_per_tag_service_·_Issue__1276_·_hydrusnetwork_hydrus One_mapping_db_per_tag_service_·_Issue__1276_·_hydrusnetwork_hydrus We have nothing more to discuss.
thatfuckingbird commented 1 year ago

If you don't want to be called out on your passive-aggressive bullshit, then next time don't start it, retard.

roachcord3 commented 1 year ago

I'm reopening this but after calming down a bit, I realize this guy was (mostly) right. It doesn't need to be one mapping db per tag service. The local tag services should all stay in one db. The remote ones are the ones that should be separated. I am going to reword this entire issue along those terms.

hydrusnetwork commented 1 year ago

I don't have time to read all this, but after talking about this on discord a bit in the past couple weeks and a bit more today, I think I will go with this local/remote split. client.local.mappings.db and client.remote.mappings.db. We can add one more database file ATTACH without stretching things, and the local/remote split will help all sorts of backup and recovery situations. Having one file per service is more complicated than we can deal with right now.

I don't have a timeframe for this. I'll have to plan the migration carefully. Most likely the new year.

roachcord3 commented 1 year ago

Sounds reasonable and I'm sorry for the pointlessly hostile and wordy thread that preceded it. Thanks for your hard work as always.