hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.27k stars 148 forks source link

Feature request: use booru for tags only #9

Closed ghost closed 8 years ago

ghost commented 9 years ago

Okay, so, I apologize if I missed this like the other stuff, but I've been playing around with the booru downloader and can't find it:

Is it at all possible (from a technical standpoint) to just use a booru (specifically, e621) as a tag source for downloading local tags, without downloading any additional images? IIRC, they use some kind of hashing to identify unique files, but I don't know how easy or hard it would be to match that against a local library's hashes.

What I'm looking for is basically a feature that allows me to make a local selection and say "download all tags from X booru for these images, according to my default tag options". Or to install a booru as a tag service that I can sync with like a tag repo.

hydrusnetwork commented 9 years ago

This might be somewhat possible, and I have something brewing that might fulfill what you want, but it might not be simple.

First off: In hydrus, I primarily use sha256 for my hash, while most websites use (the slightly aging) md5. I generate md5 and sha1 of any file in your db just in case it is needed (the 4chan API offers the md5 of a hash, which lets me not need to download a file if it is already in your db, for instance), but websites, even if they do offer an API that responds to hashes, do not usually offer them in sha256, which makes large operations and intra-hydrus syncing a little tricky.

The problem of syncing with a booru is two-fold:

1: Having a system that goes [md5 hash]->[page/tags on e621]. (As far as I can tell, e621 creates page urls using an internal file_id, so unless they have an API that answers to md5 that I can access, we are out of luck there.) 2: Even if we can fetch tags on demand based on file hash, we don't want to DDoS e621 on the way. (Let's say a users checks for a mere 10,000 files, of which only 2,000 are actually on e621. That means 8,000 404s on top of the 2,000 page views, just for ~40,000 tags. If the user set it to check for new tags every day, we have a problem.)

I'm all about local caches rather than spamming sites for new data, so I'd rather have a local archive of all the tags on e621 that the client can refer to as needed. You can spam such an archive as many times as you like, and really damn fast, and you are only costing yourself CPU time. Of course the archive would not be absolutely up to date, but I don't see a way of individually syncing without a lot of impossibly large and rude http traffic (unless e621 were to offer a sha256 syncing-optimised tag repository API, but that's on them, not me! I suspect their business model is serving ads to browsers, not pleasing hydrus users' insatiable lust for metadata.).

However, in this vein, a user recently sent me a couple of snapshot-archives they created of md5->tags for danbooru and gelbooru. I forget exactly, but I think there are several tens of millions of mappings in each archive(!), and they are about 80MB zipped up and 600MB unpacked. I have wanted to write a simple parser so the hydrus client can query these archives. If I build that, then anyone can make a similar md5->tag dump of any site and hydrus users can share them (via torrent, fileshare, whatever). This will make things run faster and save e621 lots of bandwidth.

I can't remember how the guy got all those tags; maybe he scraped every page of danbooru? Did danbooru ever have an API? Does it now? I don't know.

Anyway, something like this is on my list of things to do, but I am not totally sure on how to move ahead. I am not sure on the workflow either. Should it be a button on the manage tags dialog that says 'check danbooru archive for tags', or an automatic check on all files, or selected files, or what?

I've sort of dumped my thoughts here. What do you think?

ghost commented 9 years ago

Okay, this is useful stuff, and I'm going to dump my own thoughts back to you in the same format.

Checking E621 reveals that they base URLs on an internal ID of some kind, but that their actual filenames for images are MD5 hashes. (Which has annoyed me for some time, because it means when you save stuff you just get a hash - no metadata in the filename - whereas on FA you get the artist in the filename in a way that is regexable or file-searchable).

Poking around I found a couple interesting things.

-E621 does have an API (https://e621.net/help/api). It does not seem to allow fetching posts by anything other than internal ID, which seems a bit useless, but maybe I don't know enough about coding to figure out how.

-You can, however, show a post's page via HTTP by doing this with its MD5: https://e621.net/post/show?md5=b67800df3cbdb4cc92e0bae25b7a60c0

-You can also ostensibly search by MD5, but it doesn't seem to work right.

-However, once you've got a post's page, you can get its ID, which should be in the same place in the page source every time, or at least regexable out of the page source.

-Presumably you can then use the ID to get the tags via the API, in the same way as I am guessing you do on the downloader page.

So, if nothing else, this would seem to provide a way to implement this functionality - but it's not elegant and it's certainly not ideal. Web scraping is gross, doesn't work cross-site, has a tendency to break, etc, plus this is a bandwidth hog (as you pointed out). And if we throttle it, say, 2 requests per second or whatever the "internet standard" is for script requests, that means syncing takes 9ish hours for a 60k image library (which isn't actually unacceptable, but you wouldn't want to do it every day... then again you wouldn't really have to, you'd just have to run it on new stuff.)

You're right that having a local copy of the tag database would be a lot better - provided you had a way to update that database periodically - but I don't actually see a good way to get one. I've been digging around today, and it looks like there was a danbooru db-dump that was posted by the admins a few years back, but nothing more recent than that and nothing from e621, and while I've been able to find references on forum threads to tag database downloaders, all I can find on Google is actual image downloaders, which Hydrus already does better and which doesn't solve this problem.

Grabber (http://www.bionus.fr.cr/imgbrd-grabber/) claims to be able to "add entries to a database for each image or tag while downloading", but I couldn't figure out what exactly they mean by that or how to activate it. It is open source, though - might be something useful in the code (I, unfortunately, am not a coder, so I wouldn't know). When I ran it and ran an empty query on E621, it showed me the top 20 tags with image counts, which is tantalizing, but I couldn't go anywhere from there other than "download all of the images".

I kind of do suspect that your dan/gelbooru tagdb user just scraped every page for each site. I wish I knew how he did that, because I would not have the first idea how to write a script to do it, but I'd be happy to leave one running on e621 from my machine for as long as it took.

...thinking about it further, the e621 API would let you do this without scraping. Set a loop to query the all IDs from 000001 to whatever (currently somewhere north of 550000), do /post/show.xml on each one, toss everything from the XML response except the id, md5, and tags attributes, log that for each. I still don't know how to write that, and it would take days, but it's definitely possible (and, again, if you gave me a script to run I'd run it). You could even do updating (for new files, not for new tags on old files - that'd probably require a full re-run) by specifying the start ID, or by checking the output DB or directory-of-text-files or whatever for the highest ID and starting there.

(I wouldn't be surprised if this approach worked for most imageboards. Danbooru and Gelbooru do have APIs, which look pretty similar, but their documentation doesn't look as comprehensive as e621's).

The question then is to translate that into something Hydrus understands and allow Hydrus to actually query it. Maybe translate it into a tag repository running on localhost? Not sure how easy or hard that'd be either. (See below for more thoughts on this - MD5 might make it difficult.)

Regarding workflow, it sort of depends how you implement the base functionality.

I was thinking initially of something in the rightclick menu to download tags from boorus, but that only works if we go the direct "query MD5, scrape ID, req tags for ID" route, which as you pointed out is really bad if you're doing it for a large number of files, and you don't have a way to prevent users from doing that without just arbitrarily restricting the max selection and throwing errors, which IMO is bad UI design.

You could I guess also write such an option to query a local database dump instead of the remote booru, so the UI considerations would hold for that. You're right that having it in Manage Tags might make more sense, except Manage Tags blocks interaction while it's open and this is going to take a long time for any large matching operation - so unless it drops you to a new tab with a progress bar when you select the match tags option, it might work better as a rightclick (I'm thinking it goes to a progress bar like the tag repo one, that floats in the notification area).

If you go the local database route it might make more sense to parse it like an additional tag repo (added through Manage Services), so that everything automatically sees tags from the local DB in the same way it does for repos. There's obviously no way to go from an MD5 to an SHA hash without downloading and hashing the full image, so the client would have to query this ersatz repo by MD5, and I don't know if your repo code is readily adaptable to do that, so whether or not that involves tons of work might be the deciding factor for whether or not emulating repos makes sense.

With either option, you could have a checkbox in the Import -> Advanced area for "match tags with local booru db" (or "with X booru" as the case may be).

Anyway. Sorry for the wall of text. This is an intriguing problem. Let me know if there's anything a non-coder can do to help with it.

ghost commented 9 years ago

Update: I have convinced my housemate, who is something of a coder, to write a Python script to scrape all metadata for all images off e621.

Here's what the raw XML documents he's getting back look like: https://e621.net/post/show.xml?id=496713

For now I'm having him just dump individual XML files into a directory, I feel like we can just parse them into whatever Hydrus needs later (once you're certain of how you want to approach this). Let me know if we should alter the format at all.

ghost commented 9 years ago

(as an aside, being able to process rating="s" into rating:safe, as a namespace tag, both for downloading and tagmatching, would be a really useful feature. The ratings on e621 are e, s, q, for Explicit, Safe, and Questionable; Danbooru and Gelbooru at least have similar rating systems, which makes sense since all three boards seem to use derivatives of the same software.)

ghost commented 9 years ago

I'm also, at an exponentially lesser level of priority, interested in scraping tags from a local Derpibooru database. I realize you do not support Derpibooru in Hydrus, but I'm guessing the local data can be re-formatted by a script on my side to look like something Hydrus can eat - basically, Hydrus could then parse it as though it were a DB from some booru that you do support. I'd probably have to scrub through and namespace stuff manually, but that's not a big deal.

I modified the script (poorly, but it serves) to pull their API data, which is only available in JSON and looks like so: https://derpibooru.org/images/9598.json

Interesting findings: -Derpibooru has 2 IDs - "id" and "id_number". "id" appears to be a hash, but does not seem to match MD5 or SHA1. "id_number" is the usual sequential identifier. -There is no identified MD5 hash. -There is a sha512 hash. -They have both "tags" and "tag_ids". I cannot tell why some tag_ids are just the tag and others are some kind of hash. Hydrus probably wants "tags". -Ratings are done as a normal tag.

So actually, with the above in mind, since Hydrus doesn't use SHA512, there's probably no way to parse this data.

ghost commented 9 years ago

Okay, here's the full rip of E621 up to today:

https://mega.co.nz/#!YsEyiTzD!wM6VHh_SOq2O9lzg7fKigGA-HcSAcmxxMkDI4kUxM0Y

This is still in the form of individual XML files for every (valid) ID on the site. This means that it's about 550,000 0.5-2kb files. It expands to about 542mb, but if you're on a default NTFS file system, the minimum allocation unit size is 4kb, so it will actually take up about 2.1gb of space.

Obviously this will need to be reformatted into some kind of database, but I thought I'd upload what I've got so we can figure out what needs to be done on it.

This file also includes the ripper script I used. It has a tendency to crash randomly, but it saves its place well - so I worked around the problem by creating a batch script that kills python.exe and relaunches the script, and told Task Scheduler to run that every 30 minutes.

hydrusnetwork commented 9 years ago

I'm going to attempt a prototype of this this week. I'll also try to write a mini-library to make it easy for any programmer fluent in python to create a database that hydrus will understand.

hydrusnetwork commented 9 years ago

Ah, just saw the e621 thing. Thank you for doing it. I will convert it to my new format and include it in next week's release post.

ghost commented 9 years ago

Oh, awesome. Thanks!

Yeah, sorry, I know I've been clogging this post up with thoughts and data this week. I absolutely am not intending to nag you or anything, just trying to provide as much useful info as I can.

hydrusnetwork commented 9 years ago

Please do post as much as you like. It is never a hassle. I put everything in my to-do, so nothing gets missed.

FYI: I made a basic db schema and the mini-library today, and ~32 million danbooru tags munged to a 1GB sqlite database that 7zipped down to 256MB. I'll do my gelbooru archive and your e621 tomorrow, and then I'll plan how to add some gui stuff for it. It all went much easier than I expected.

hydrusnetwork commented 9 years ago

I've added a Derpibooru gallery parser/downloader to my longer-term to-do as well. After a brief look at their HTML formatting, the bits I need to parse look just like a normal booru, so it shouldn't be too difficult.

ghost commented 9 years ago

Brilliant, on both counts. Should I start running the JSON scraper against all of Derpibooru, or does their use of SHA512 preclude their longer-term inclusion in the local DB parser?

ghost commented 9 years ago

Also, I had been wondering - Hydrus gets namespaces for tags on boorus. How does it accomplish that, technically? I'd noticed in the various API dumps I'd looked at that, for example, "character:kipfox" is just "kipfox" - namespaces are added by the booru server and don't appear to be in the raw tags.

Does Hydrus scrape the "all tags" library for each booru and associate namespaces that way, or is there some other way to do that I'm missing? And will this method work when scraping a local DB, which doesn't include the namespaces?

ghost commented 9 years ago

(Side note, spending my day sitting in a comfy chair with an Xbox controller hooked up to my computer, mapped to keys through Antimicro, and the keys mapped to a custom filter in Hydrus, tagging my 10,000 wallpapers. You've revolutionized my handling of images and it's glorious.)

hydrusnetwork commented 9 years ago

Sorry for slow replies here; I've had issues this week that stopped me working on hydrus as much as I'd like.

I'm going to calculate sha512 hashes for all files in the v136 update, so feel free to scrape Derpibooru or anything else with sha512.

I get my namespaces by parsing the html of the image page. I don't think I use any APIs in my gallery downloaders; it is all ugly html parsing that mostly works. On the image page, the html classes are different for the different namespaces (I think they probably colour the text through css that way). You can see the specific settings for all this stuff in services->manage boorus, under 'tags', although that dialog is hard to understand, and mostly just for my debugging benefit. Usually it is something like 'tag-type-artist' for the 'creator' namespace, and so on.

I hadn't thought of mapping filter actions through an xbox controller. I like it! I had the exact same problem as you -- too many files, and no software (as far as I could see) that would actually process it how I wanted. I touched inbox-zero for about three days a year ago, and now I've fallen behind -- I think I have ~250,000 archived and ~35,000 and growing in my inbox.

I've just uploaded the processed archives for danbooru, gelbooru, and e621 to mediafire. My post (which will go up in an hour or so) links to them if you want to check them out, and I talk about and include the mini-library to interface with them using python. They are easy to update with new data.

Danbooru is 1GB (I think ~32million mappings) Gelbooru is 1.25GB (~42 million) E621 is 350MB (~12 million)

Which I thought was interesting. They 7zip to about 25%.

ghost commented 9 years ago

I have found Xbox controllers to be wonderful multipurpose input devices - any time you need a smallish set of predetermined actions at your fingertips in an ergonomic fashion (that is, without dragging your keyboard to your couch or whatever), binding keys to a controller is usually the best way to go.

Saw your post. Great stuff - I don't know enough python or database stuff to really understand the posted script, but it looks like the basis for a very workable and expandable solution. If I understand it correctly, it's a library designed to take preparsed tag data, and so it should be called after a custom parser has already run across an XML or JSON rip and dumped it out to a table of either "hash, tags" or "hash, tag", right?

Namespaces seem to be a tough problem. I didn't realize they wouldn't be included in the e621 bare tags until after I did it, so I didn't include them - my immediate goal now is to rectify that, because the namespaces (in particular, creator/artist) constitute a significant chunk of what I find valuable about the e621 metadata.

So: The e621 API does expose tags, 20 or so at a time, with count of posts, tag ID, and tag type (which is an integer that maps to a namespace). I can probably just rip it page by page, and if I can't do it via the API then I can definitely do it via wget magic and manual HTML scraping with TextPad.

What I'd end up with either way would be a text file for each e621 namespace that would contain Hydrus-fied namespace tags for every e621 tag in that namespace (so, for example, "creator:zen" instead of just "zen" (or "artist:zen")). I could also trivially make that a CSV with the bare mapping (so, "creator:zen,zen").

How would we insert that data into the existing rip of e621's tags? We'd have to detect that "zen" should be replaced with "creator:zen", then replace every instance of "zen" with "creator:zen", and then repeat that for probably 1/3 to 1/2 of the tags in the database. I can't tell if that's something your library is designed to do as such, and I don't know if it's trivial or not in SQL. I also can't think of a good way to do that against 550,000 unparsed XML files. Let me know what the best angle of attack is and I'll get on it.

(Side note: how did you parse the XML rip into a format your tag archive construction library will take? Despite not being a "real" coder, I'm really interested in seeing that logic.)

I will also get working on a Derpibooru JSON rip, and maybe get my roomie to write a parser using your library. I'm not sure if they have namespaces as such, so I'll have to look at that. If they don't, mass-adding tag siblings will probably be the only option (maybe at some point down the line you could add a "bulk add from CSV" option to the Tag Siblings and Tag Parents windows, to facilitate this kind of thing?)

ghost commented 9 years ago

@marcules - hey, awesome, thank you! I really appreciate that.

@hydrusnetwork - Okay, now that we have that XML file, what should I do to merge it with the rip? I am thinking that I should start with replacing the type="4" or whatever with type="character" or whatever, but I'm flying by the seat of my pants from there.

After deleting the "type=0" ones, which we obviously don't need to worry about, there are 78073 tags with namespaces (interestingly, about 30,000 of those have a count of 1).

I immediately noticed that these tags use foreign Unicode characters that are in numeric character reference format. I was worried about this, but I figured out how to convert them to UTF-8, and in the process realized that TextPad does not support Unicode, like, at all. Notepad++ it is, I guess, but ugh I was really comfortable in TextPad, why must Helios betray me so. Examining e621.db in SQLiteStudio, it appears the Unicode is fine there, so, presumably nothing to worry about.

Further examining e621.db in SQLiteStudio, it looks like all I would have to do is create my CSV as above (character:zen,zen) then do a find-replace operation on the "tags" table that replaces anything found in CSV column 2 with what's in CSV column 1, and then run a script that simply does "hta.GetNamespaces". I'm kind of assuming that if I had any knowledge of SQL at all I'd know a way to do that through manual query commands, but... yeah. I know it's incredibly wrongheaded and heretical of me to even think this, but my background is such that I am seriously wishing I could yank the tags table out, edit it in Excel, then put it back in.

(Actually Google is suggesting that I may, in fact, be able to do something sort of like that. If you guys don't have a laughably trivial solution for this I'll dig around this weekend.)

Other things I'm going to have my roomie look at: parsing the ratings out of my XML files and using the HTA library to insert rating: as a namespace tag in the e621 rip. That seems pretty straightforward to do with the commands in the HTA library.

ghost commented 9 years ago

...it occurs to me that I don't need to find anything, I just need to replace the whole tags table in the DB with the list of tags (including type 0) after making them namespace-y.

That sounds a lot easier. Will mess around with this more.

ghost commented 9 years ago

Okay, here's the namespace-ified taglist: http://a.pomf.se/tbedve.txt

I'll mess with DB stuff after work, but yeah. Immediately noticed there are about 5000 more lines in this file than there are rows in the tags table of e621.db, so I'll have to reconcile that. If anyone has any idea how that happened I am interested in hearing - I don't know if I buy that 3% of the tags on the site were added in the last 5 days, and e621 handles aliases/tag siblings prior to writing to their DB, so it isn't that.

Side note, I kept thinking I was seeing corruption from my find/replace ops, and then I realized that, yes, things like creator:(_:d)rz=3, (mlp (with no close parens), <3, #@?! and the like are actually tags that people have put on images on e621.

My reaction to this: http://img1.wikia.nocookie.net/__cb20101223062640/facepalm/images/e/ea/TripleFacePalm.jpg

ghost commented 9 years ago

It also occurs to me that it is important to be certain that no tags have colons in the body of the tag, if you're doing namespaces. For example, series:2001:_a_space_odyssey would seem to pose a problem.

Edit: There are 86 of these in the namespaced list. The regex to call them out is [\S]+\:[\S]+(\:[\S]+)+. This does not include the tag, mentioned above, which has a goddamned full-width colon in it, but I assume HTA doesn't notice that since it is technically a distinct character.

@hydrusnetwork - how would Hydrus parse the above if it showed up in a tag archive? Does it only take the first colon, or would we get some kind of weird interaction?

(Also now that I think of it, does Hydrus automatically replace underscores with spaces, or should that be done when the tag archive is being generated?)

ghost commented 9 years ago

Preliminary examination suggests the discrepancy in lines of tags vs. rows in the DB is accounted for by tags that only exist on deleted posts, which are for some reason kept on e621's tag list but of course do not show up in the rip. Now that I know why they're there and don't have to worry about them, getting rid of them shouldn't actually be hard.

Side note, I am amused to discover that your e621 tag database did capture the completely empty tag, tag_id 120505, which is literally nothing (zero characters) and which has a count of 2 on e621 but which I can't find out the mappings for on the site, because, being zero characters, it is of course unclickable.

I am simultaneously very impressed that your parser caught that and utterly mindboggled that it somehow exists in the first place. It's getting to the point where I kind of want to remove every tag with under about 5 mappings from my copy of the DB before I let Hydrus have it.

hydrusnetwork commented 9 years ago

@disposablecat Your text file with all the namespaced tags is great. As you guessed, updating the tags is easy with a bit of SQL. I would do something like:

(this is python, using sqlite3 for the SQLite cursor 'c')
for namespaced_tag in my_text_file_of_namespaced_tags:
  ( namespace, subtag ) = namespaced_tag.split( ':', 1 )
  c.execute( 'UPDATE tags SET tag = ? WHERE tag = ?;', ( namespaced_tag, subtag ) )

And then, when it is all done, playing with hta.RebuildNamespaces() so the namespace cache is good.

If you or your friend aren't confident of doing that, I've got your file and can do it my end.

In hydrus, I accept colons in the subtag. series:2001:_a_space_odyssey is fine, and treated as ( series, 2001:_a_space_odyssey ). The namespace is whatever happens before the first colon, and then everything else is ignored. This is what is happening in the '.split( ':', 1 )' rather than split( ':' ), which would split it into n subtags rather than a guaranteed two.

I parsed all that xml by opening a python shell and doing something like:

for path in dircache.listdir( 'xml_dir' ):
  root = lxml.parse_or_something( path ).getroot()
  hash = root.attrib( 'md5' ).decode( 'hex' )
  tags = root.attrib( 'tags' ).split( ' ' )
  hta.SetMappings( hash, tags )

with a try-except block to catch problems and a thing that printed its progress every thousand files so I knew it was still working. It took about ten or fifteen minutes, I think. About a hundred thousand or so of the files were deleted from e621, I don't know why so many. Those xml files were missing an md5, so I couldn't add them. I assume that is indeed where the five thousand missing tags are from.

I laughed at you encountering all those weird tags. In working on hydrus, I've encountered so many snippets of incomprehensible unicode and human spelling errors and no end of results I have just parsed incorrectly myself, it was nice to see someone else seeing it too.

When I first started this, I thought I could create a 'perfect' tagging system, and in pursuing that I figured I would be swamped with petitions and be fighting flame wars on the correct application of certain tags, but thankfully the good tags greatly outweigh the bad, and the bad don't really matter. I just allow everything and assume we'll either sort out anything that is incorrect in the future or just forget about it.

I feel the same way about underscores. At the moment, I don't want to force anyone to not use underscores, so if there is a common 'artist_name' that should be 'artist name', I'll let someone add a tag sibling or just manually tag-and-petition the correction to my repo. Many times I've ended up putting too much time into trying to get the 'exact' tag system when a 70% 'good enough' system was fine.

In that vein, I think hydrus has a tag filter that'll ignore those empty tags, so it doesn't matter too much. I'd be interested to know how e621 accepted them, though!

ghost commented 9 years ago

Those code snippets are useful, thanks. Currently I think I have a non-code way to put the updated list back into the HTA DB so that I can run the namespace commands (out of IDLE, probably) - there's an SQlite editor I found that can import/export CSVs from tables. If that fails, I'll try that code.

Yeah, a LOT of duplicates get upped to e621 - people don't bother searching before they just dump the latest posts from Furaffinity or Tumblr. Lots of downscales and such too. And yes, there's no MD5, which makes the data annoyingly useless (if they'd left the MD5 I could at least capture some tags for some of the downscales I inevitably have), so I kind of wonder why it's still accessible from the API at all. That's definitely where the extra tags are from.

At this point the problem I'm having is actually scrubbing out those deleted tags. I thought I got Excel to do it by splitting the namespaces to a separate column and comparing the CSVs, but then it turned out it deleted too many. So, later today I'm going to try to find some sort of dedicated list comparator engine I can use.

(It occurs to me that if I am reading your update code correctly, it would actually just skip the deleted tags. I might try that directly, but for now I'm enjoying the challenge of doing this list comparison correctly - I can never pass up the chance to learn something new about text processing.)

Good to know about colons. Regarding spaces, the problem I see is that on e621, tags that have spaces on the booru have underscores in the DB, for some damn reason (probably that they're separated by spaces in the DB). So among other things, if you were to download stuff from e621, your tags wouldn't match what you got out of the HTA db and you'd have dupes (brown hair, brown_hair).

That's easily fixed though - once I have the correctly scrubbed taglist, I'll just find/replace underscores with spaces before I put it back in the DB. It's a lot easier than the fun I'm going to have doing tag siblings in the GUI for "creator:furaffinity artist username" (which is easily regexed out of FA-sourced files when they are imported to Hydrus) to "creator:Artist tag on e621", which is a process that is impossible to not do manually.

I basically figure, yeah, all of these horrible tags have microscopic counts, so first off the odds are I'll never see them because I don't have the images they map to, and second off if they become a problem I can just delete them after import by selecting all, managing tags, sorting by incidence asc, and just clicking through. I definitely agree that "good enough" is really all you need for something like this - the point is to be able to use the big, well-curated tags, not the little overly specific stuff (I swear to god, cum_on_knee is one of the deleted tags.) And honestly at this point having tags for my local stuff at all is so refreshing that I could care less if it's comprehensive.

Side note again: You know what'd be really neat, when you do the GUI stuff? A way to export tag archives for a specific set of images. That way someone can share their nicely curated tags for stuff (on a filesharing site or an imageboard or whatever) without necessarily having to run a public repo.

hydrusnetwork commented 9 years ago

@marcules Thanks for the updated gelbooru and danbooru archives. They look great. I'll upload them to my mediafire for this week's release.

@disposablecat lol at cum_on_knee. I love the internet.

Exporting is a great idea. I don't really like the current way to export tags. This would be a better replacement, at least for lots of tags. I'll add it to my to-do list.

ghost commented 9 years ago

While I work on this, things I noticed about Derpibooru:

-their only namespaces are Artist, Spoiler, and OC -these are encoded in the DB as namespace:tag

Therefore, all that should need to be done when I parse this rip is to run the namespace HTA command. Shame they don't tag characters, but what can you do?

ghost commented 9 years ago

Tags that exist: f̵͕͓͇̹͙̅ͭ̑̽ͭe̛̹̜̯̫͍͕͉̰̰̫̎ͣͤ͛̔͒̒͊ͮ̓͟ma̝͍̼̮͖̞͎̬̳̩ͩ̂ͪ̑ͧ͊͆̏ͧ̔̀̕l͙̠̦͎̞ͨ̎ͦ̃̚͝e

Which looks like about a 40 character line when you're viewing it in textfiles.

WHY. JUST. WHY. (Count of 1 mapping, of course.)

(I think I've almost got this. Stand by.)

ghost commented 9 years ago

@marcules I don't know how other people consume art, but I've got about equal quantities of explicit and non explicit - so, yeah, ratings would be awesome. If you would be able to pull the ratings out of the e621 XML dump I upped and insert them into the next DB I upload (which should be soon, once I'm clear on the uniqueness issue) that would be super appreciated.

(I may also ask you to parse Derpibooru once it's done, since my roomie a) is busy with thesis and b) has a vocal opposition to pony).

ghost commented 9 years ago

(re-adding accidentally deleted comment.)

Hey @hydrusnetwork - quick question, will the tag archive format utterly break if I set "tag" in the "tags" table to not be unique? Or will it behave fine with two tagIDs that map to the same tag?

I ask because as part of cleaning this up, I fixed everything that was in the general format creator:name (artist), replacing with just creator:name. However, because of bad tagging, that has created a ton of duplicate tags.

From the perspective of the user, this is desirable, because if those tags can be considered the same when Hydrus imports them, then the imported tags are cleaner and more unified. But I'd like to make sure this won't break anything horribly before I upload a version with it done.

Here's the CSV I'm using: http://a.pomf.se/ipjrxe.txt It is tags from the e621.db you uploaded, but with namespaces added, underscores replaced with spaces to match the way they appear on the booru, and the above duplication in place.

hydrusnetwork commented 9 years ago

@marcules Sure, I'll have the csv file, thanks. It'll be interesting to look through if nothing else.

I don't mind about ratings namespace. I'm not fond of them myself because no-one can agree on precisely what questionable or safe mean. If you want to include them, feel free, but do make sure they are namespaced so people can easily exclude them from mass-imports.

@disposablecat If you make tag non-unique, I think you'll be semi-fine but not great. Deleting mappings through my code interface will break, and maybe other things will as well. I would suggest collapsing the surplus tag_ids in code, but if you don't feel confident doing that, you can do the non-unique thing if you like and send me the db and I'll do it my end.

If you are picking up any python or other higher level language, I would recommend something like:

for ( original_tag, new_tag ) in rename_generator:
  result = c.execute( 'SELECT tag_id FROM tags WHERE tag = ?;', ( new_tag, ) ).fetchone()
  if result is None:
    c.execute( 'UPDATE tags SET tag = ? WHERE tag = ?;', ( new_tag, original_tag ) )
  else:
    ( new_tag_id, ) = result
    ( old_tag_id, ) = c.execute( 'SELECT tag_id FROM tags WHERE tag = ?;', ( original_tag ) ).fetchone()
    c.execute( 'UPDATE mappings SET tag_id = ? WHERE tag_id = ?;', ( new_tag_id, original_tag_id ) ) # this line may take a long while to process because tag_id is not indexed in mappings
hydrusnetwork commented 9 years ago

@disposablecat I just thought--if it looks like a lot of work, don't worry about those 'creator:blah (artist)' tags--there are tons already of those in hydrus, so we'll have to fix it my end anyway, probably with siblings.

ghost commented 9 years ago

Okay, things that are problematic: Tags that have colons, but that are not intended to have namespaces. This tends to come up in the context of, like, the legend of zelda:windwaker (which should be a series but isn't), or ärtist (again, WTF, man, WTF.), or edward:elric (is he a Freeman On The Land now?) There appear to be about 82 instances of this in the e621 dataset.

EDIT: AND IN THREE SEPARATE CASES, http: SHOWS UP IN TAGS AARGH PEOPLE ARE WASTES OF OXYGEN BUT I NEED SOCIETY TO LIVE.

I fixed the misspellings. That was obvious.

Regarding the other stuff: I think based on some experimentation that Hydrus' behavior when downloading tags from e621 is to discard anything that looks like a namespace but is not in the Advanced Tag Options. I suspect your behavior with HTA DBs will be the same - if it isn't in the namespaces table, don't import it. In any event, I'd like this database to behave exactly like the downloader as much as possible, where that is in my control.

With that in mind, I'm manually excluding the problematic tags from my RebuildNamespaces command. As usual with edge cases, most of them are 1-count or thereabouts, so it isn't like it matters, but I wanted to fix it anyway.

ghost commented 9 years ago

@hydrusnetwork I already cleared the creator:blah (artist) stuff, as well as the same for series and character, so, it's actually easier to leave it in. Also, I'd prefer as much (uncontroversial) cleanup as possible to be encoded in the DB, for reasons of efficacious sharing.

With that in mind, I will have the namespaced DB up shortly, with tags_tag_index set to non-unique. I definitely don't trust my understanding enough to try to build the code snippet you mentioned, so if you can collapse the duplicates (after @marcules merges in rating:x, assuming still that he's OK doing so) and re-unique the index before uploading it to the "official collection of tagdbs" I would appreciate that.

ghost commented 9 years ago

And here is the namespaced DB: https://mega.co.nz/#!g88UAbbJ!t0Ed6C5nKM49YA7nLbZCk2PmJVkkNQHDf74EE5dhraM

To recap for anyone encountering this down the road: This is the e621.db posted on the Hydrus tumblr, plus series: creator: character: species: namespaces, with underscores replaced with spaces, and with a couple thousand duplicates in tags (resulting from cleanup as above). It is up to date with e621 as of 11-1-14.

ghost commented 9 years ago

It occurs to me this might be useful as well: here is namespaces_to_exclude for the RebuildNamespaces command on this e621 DB, which is necessary because of colons in tags as described above:

set(["gee", "ge", "lethal", "edward", "4", "<", "11", "fallout-equestria", "big", "fallout", "trouble", "sonic adventure 2", "gen", "aet", "specie", "barack the barbarian", "tribes", "[9", "ecco", "the legend of zelda", "lion (fo", "disaster", "d", "american dragon", "making a cat cry", "re", "3", "one piece", "lunar", "7", "berserk", "hero", "fallout equestria", "cop", "oc", "titanic", "mouse (fo", "atr", "aion", "legacy of kain ", "kami (m", "pokemon", "saybin", "ewyn", ">", "commissioner", "cgar", "copy", "wububububububububibubububibu ", "breath of fire", "12", "14", "16", "18", "sonic (series)character", "v", "mad max", "char", "chaw", "dead to rights", "csi", "author", "1", "5", "codename", "http", "medal of honor", "user", "rhapsody", "m", "hetalia", "coppy"])

hydrusnetwork commented 9 years ago

Thanks for the e621 db and the namespaces_to_exclude list. To collapse the dupe tags, I ended up adding a little method, IterateMappings, to my HTA class and did this:

old_hta = HTA.HTA( 'e621.db' )
new_hta = HTA.HTA( 'e621_new.db' )
new_hta.BeginBigJob()
for ( hash, tags ) in old_hta.IterateMappings(): new_hta.SetMappings( hash, tags )
new_hta.CommitBigJob()
# rebuild namespaces here

Thus forming a new pristine db with clean ids everywhere. Everything seems to be working ok, so I'll upload these three dbs to my mediafire as I put my post together later today.

Thanks to both of you for your work here, it has really helped.

ghost commented 9 years ago

Ooh. That's such a deliciously elegant solution to the problem - me likey.

And yeah, I am obviously pretty highly motivated on this one - now that the possibility exists of something I've wanted for years (a comprehensively tagged local image db), I am willing to throw as much free time at it as necessary to get there, and I'm the kind of guy who never does anything manually that he can automate.

I'm glad I can help even without coder skills, and I appreciate the clarity of design that allows that, and your assistance - certainly you're the most responsive indie dev I've ever encountered. It helps that this is clearly a passion project for you - I don't doubt that having that kind of motivation leads to a better experience for your users.

hydrusnetwork commented 9 years ago

Together we can pierce the tagging heavens!

ghost commented 9 years ago

Reposting this here because this is semipermanent and centralized:

Here is a working Hydrus Tag Archive of Derpibooru up to 1-24 (id 814383):

https://mega.co.nz/#!BsMw0aBZ!2iFV3TAH_QYzg4fg1kg_aCCpwyp4eCrS5LMTi2cDTqM

This has had various cleanup performed on it to make it suitable for Hydrus. Specifically:

-everything that could reasonably be interpreted as "creator:" has been replaced with that -Ratings have been namespaced -the most popular species and character tags have been namespaced -OC: has been replaced with character: -all files have series:my little pony -spoiler: is a namespace, if you find those tags as infuriating as I do, uncheck it in Hydrus when you add this.

This matched about 1/5 of my pony folder, probably due to resizes and people from /mlp/ and /pone/ not overlapping with the Derpibooru userbase.

The working scripts used to create it have been updated here:

https://gist.github.com/disposablecat/9fa1ee3d036d2e353dcc

All the cleanup above was done automatically via derpicleaner.py - no manual adjustment was used whatsoever - so if anyone has any objection to any of it they can comment out those lines and re-run the scripts. This will also make updating this relatively easy.

Mooninaut commented 9 years ago

I don't know if this is any help, since it seems like you've already downloaded the entire dataset, but you can search for e621 posts by MD5 using the api. You search for the tag "md5:xxxx". Example: https://e621.net/post/index.xml?tags=md5:50bac7d7e9749656c98485eed00a6e79.

You can also do range requests on post IDs, instead of grabbing them one ID at a time, using the syntax id:min..max. For example, https://e621.net/post/index.xml?tags=id:200..299%20order:id&limit=100. The default limit on results per request seems to be 80, and the maximum they allow at present is 100. The default ordering is by id, descending, so I added the order:id meta-tag to change it to ascending.