Open roachcord3 opened 1 year ago
I don't like the idea of rearchitecting how the db works just because its not ideal for your specific backup settings. Should just change your backups. If you disabled compression for that backup target it would be much faster.
It's not my problem you chose a backup provider that's too expensive or limited. The PTR file isn't even huge by sqlite standards, sqlite was designed to be efficient with far larger files than this. It wouldn't make it easier to distribute copies of the PTR, because the tag mappings, that take up all that space, use hash IDs that are unique to your database (this is why processing is needed at all instead of just copying in the PTR - if it was only a matter of having a table imported that could be done without a separate database). The only thing you can do is distribute an entire, otherwise empty, database with PTR preprocessed, which doesn't need this (and was done in the past already but abandoned). Additionally, sqlite in WAL mode cannot do atomic transactions over multiple attached databases, which might be a problem depending on how the databases are split up. Additional logic would also need to be implemented for the cases where there are more services than the attach limit.
Also, not every github issue has to be "your problem."
That seems to be its own problem, then, isn't it? If it's not an intractable one, wouldn't it make sense to address that so that we could distribute the PTR easily like I was talking about?
It is a very hard problem, hence why it hasn't been solved yet despite the huge upside of doing away with processing. My point was that this change would not make it easier in any way, so it is completely unrelated to this issue.
That's a trivial problem to work around; even a basic semaphore will do the trick.
"its ez bro just do X". Yeah maybe it would, but maybe it wouldn't. It would certainly need to be taken into account in all future service-related db changes. Data integrity is already an exceedingly hard problem where sqlite helps us a lot. Let's not make it harder to be helped than necessary.
Re-read the original issue. That limit is effectively 125, and the "additional logic" would be to check getlimit, subtract 2 or 3, and reject any operation that would result in more databases than that.
It is not effectively 125, it would be 125, if hydrus shipped its own sqlite for all platforms. Hydrus doesnt ship any custom compiled components for a very good reason. Making that work for all the supported platforms and architectures would be exceedingly difficult. The bundled mpv libs come closest (which still aren't custom compiled by hydev), and those are already a huge source of pain and user issues. For optional features it wouldnt be a huge deal to tell people to go compile their own sqlite if they want it, but here if you can't use the custom one, your hydrus would be crippled in a major way by the limit. Not to mention putting arbitrary limits like this on number of services is just bad design. You can't know where this will come back in the future to bite your ass (and you just know there already is some madman out there with >125 tag services. I would bet on it.)
Overall, while this could be implemented, I see a lot of downsides but no upsides besides making your backups faster, which would be better solved by you finding a better solution to do those backups.
Also, I just noticed that you not only want these services in separate files, you also want hydrus to gracefully handle cases where suddenly an older version of the service db gets swapped in (hence the weekly vs. daily backups you want for some databases). This is another huge complication potentially. What if there are references elsewhere to the missing content?
@thatfuckingbird You're being overly confrontational and making stuff up, so I'm not going to respond to anything that isn't about the technical details.
I'm not making up anything. Btw. my first comment was 2 lines, explicitly stating a personal opinion. Read your own reply to that. I wasn't the confrontational one.
Technical blockers and prerequisites are never unrelated, which is why you brought it up initially, no? I'm confused why you say it's unrelated; I assume you meant to say something different.
No. You could have this feature with or without making the PTR portable. It is not a blocker. It is unrelated. It was you who brought it up, in the OP. I just pointed out that what you wrote there is wrong.
It's the transactional/relational database model that's helping, not sqlite.
Wrong. That is one part of it. There is a lot to take care of between an application calling BEGIN/COMMIT and the data actually being recorded to disk, and sqlite does a lot to make that part work as safely as possible. I do not want any change that could possibly affect this.
There's already a sqlite binary in the db folder in the distributed releases, which means someone out there in the world compiled it at some point and got it to work in an easily-distributed way, and the build scripts bundle it in without users having compatibility issues, as far as I can tell, since I don't see people in Discord telling users to compile sqlite themselves already. So, "exceedingly difficult" seems like a stretch.
Yes and now all the effort of "compiling it and get it working in an easily distributed way" will fall to hydrus. And what about everyone else not running the releases? That is at least everyone on Win7 or older and soon everyone on Linux too (if all goes to plan). It is a major pain and not worth it at all to deal with requiring custom binaries. Hydev might or might not want to compile his own sqlite for 2 or 3 platforms, but I sure as hell don't want to, even for 1 (I did it before, for a different compile-time limit that was affecting hydrus).
This is the only valid criticism you've made, since it's the only thing where it actually affects the design and future capabilities of hydrus.
No. Stating that from a technical POV this is a bad idea and not worth the cost at all is completely valid criticism. Who said we are only allowed to discuss abstract design and not the actual implementation?
With that said, if there's someone out there with >125 tag services right now who's doing it for some insane, invalid use case, then he'd just have to cope, wouldn't he? What's more reasonable, compressing your backups or having >125 tag services?
Could say the same to you. Why do you think that use case is invalid? Complicated backup solutions are out of scope for Hydrus, that's why it disables the built-in backup & recommends external tools for anything but the most basic db layout. The guy with >125 services is using hydrus as intended.
However, if it's just a vague feeling of "well, maybe there could be an issue of that sort," then I'd like to ask you to not make a mountain out of a molehill by calling it "another huge complication possibly."
That why I wrote "possibly". Seems clear enough to me. It's not FUD, it is a valid concern that should be addressed. Since you wanted examples, sync status and caching immediately comes to mind.
We should have a common goal here of making hydrus better, so let's act like it.
Pointing out why this change would be worse for everyone not using borg + a bandwith limited provider is part of this. That's all I'm here to do.
If you don't want to be called out on your passive-aggressive bullshit, then next time don't start it, retard.
I'm reopening this but after calming down a bit, I realize this guy was (mostly) right. It doesn't need to be one mapping db per tag service. The local tag services should all stay in one db. The remote ones are the ones that should be separated. I am going to reword this entire issue along those terms.
I don't have time to read all this, but after talking about this on discord a bit in the past couple weeks and a bit more today, I think I will go with this local/remote split. client.local.mappings.db and client.remote.mappings.db. We can add one more database file ATTACH without stretching things, and the local/remote split will help all sorts of backup and recovery situations. Having one file per service is more complicated than we can deal with right now.
I don't have a timeframe for this. I'll have to plan the migration carefully. Most likely the new year.
Sounds reasonable and I'm sorry for the pointlessly hostile and wordy thread that preceded it. Thanks for your hard work as always.
User story: I have a split db, and I take nightly backups of it using borg, but the PTR makes it so that
client.mappings.db
changes quite a bit almost every night, resulting in borg backups, after deduplication, still being around 20GB every night. This is not a storage concern, since I do periodic pruning, but it does make the backups and integrity checks take a long time to complete: when I was using lzma compression, it would make backups take ~5 hours and integrity checks would take 13+ hours. (I eventually switched to zstd, which is 10x faster and increases archive sizes by only around 50%, which is an acceptable tradeoff for me, but maybe not for someone who's space-constrained.) Yes I do somewhat invite this upon myself by having a split db, by using the PTR, by using compression, yada yada, but I'm sure many users are in the same boat as me.If the PTR's tag service was its own db, then I could back it up weekly or monthly instead of nightly, without having to manually pause PTR processing. Of course, treating the PTR as a special case is never the goal of my FRs like this. I am interested in general solutions. It's easier to write code that treats the problem generically, and it makes the feature more flexible.
NOTE: originally, I worded this as a matter of making it purely generic, not differentiating between local and remote tagging services. I got pushback on that, and after cooling down, I realized that there is no need to have a separate db file per local tagging service. Ultimately, the problem I'm facing is with the PTR and would apply to any remote tagging service, which are under far less of the user's control than local tagging services are. I think a suitable solution for this would be to separate the remote tag service mapping dbs into their own files. For most users, this would mean just one remote tag service, the PTR, but for some, it might mean several db files, which would exceed the default limit of 10 attached dbs at a time. IMO, there could be a lot of benefit for users of deduplicating archivers like borg if each remote service got its own mapping db, but I understand that it adds complication, so an acceptable compromise would be "one separate mapping db for all remote tag services, together."
Click this for the original text of the rest of the issue, which is still relevant depending on whether you want to go with "one mapping db per remote tag service" or not.
As discussed in discord, [the default hard limit for the maximum number of attached databases is 10](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.setlimit), but this can be increased by [recompiling sqlite with the macro `SQLITE_MAX_ATTACHED`](https://www.sqlite.org/c3ref/limit.html) set to [as high as _125_](https://www.sqlite.org/limits.html#max_attached). I found a handy blog post that, while focused on enabling other aspects of sqlite rather than increasing its attach limit, does show [how to compile sqlite for use with python](https://charlesleifer.com/blog/compiling-sqlite-for-use-with-python-applications/), so I figured it might be useful to share. Whether this means you would distribute a recompiled sqlite alongside hydrus or you would tell power users to do this work of recompiling and installing sqlite themselves, I don't know. Once the limit is sufficiently increased, you could split the mappings db into one db per tag service, plus an extra miscellaneous mapping db for everything that isn't a tag service, I guess. You know your db layout better than I do. You could also even add a warning to users who try to add more tag services than would work with the current connection limit (accessed with [`getlimit`](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.getlimit)), minus 2 or 3 for some future wiggle room for yourself to work with in case you want to add more standard dbs that would work within the default limit of 10. Other user stories for this change: * ~~New user wants to sync with PTR quickly. This change could make it easier to pass around a synced PTR db that wouldn't be absolutely huge, saving each user hundreds of hours of computation. Multiply this by thousands more users over time (plus, the PTR gets only bigger and longer to sync over time) and you've got savings in terms of millions of hours of computation.~~ Apparently, there's another issue with hash ID uniqueness that would also have to get addressed first before this benefit could be realized, so consider this moot. * (Probably niche) Would let a user, through symlinking, put mapping dbs that get more I/O on separate disks to increase total I/O bandwidth (probably applies mostly to spinning disks), or to put the wear on disks they're more willing to wear through faster.