lurkbbs / e621dl

The automated download script for e621.net. Originally by @wwyaiykycnf.
11 stars 3 forks source link

Download once and never apply again #26

Closed Tearlow closed 4 years ago

Tearlow commented 4 years ago

I never got around trying to set up any e621dl past the original. But I never really needed as it was always working until recently. But I cannot seem to wrap my head around it to be honest. Perhaps it isn't even possible with the current solution but.. I'll still ask.

The original created a cache database with md5 hashes. Once it runs it checks config for last_run and eventually checks e621 for entries via tags. If found- Check the cache database. Does it exist in there? No? Download the file and add an entry in the database. However, if it does exist it ignores the file entirely.

This all means is that you could run through the folders, save/sort what you want and eventually delete the folder and thus make the next time you run it not add what files you had already gone through. This is the current issue I have. I just want it to download the file once. The next time I run e621dl; Download anything new since last run If I delete the local copy do not add it again.

Is this at all possible? If not.. Might there be any alternatives?

lurkbbs commented 4 years ago

Sorry, there is no way for now. And I'm sure it won't be in next week at least (Edit: disregard that). There's ability to block post entirely form all folders, but not for a particular one, but I don't think that's what you need. Also, there may be a bug when all blocked post list just vanishes. Never really used myself.

I thought about it but ultimately scrapped the idea as hard to implement at the time. It should be easier now.

So, correct me if hypothetical scenario is not correct:

  1. A setting for a folder to not redownload (or copy form another folder) deleted files
  2. A setting to reset db if tags are changed.
  3. Each tracked folder contains a DB with posts that was downloaded there and filter sequence for the folder. Sequence because subfolders are checked one by one. The DB updates on e621dl exit for all folders, but power outage should not result in loss of previous data.
  4. Folder name should not matter as long as filter sequence is the same.

Now, about that filter sequence. Since there are subfolders, each subfolder has search parameters applied for all parent folders recursively before parameters of a folder itself. It feels like it would be a bit of a pain to track.

I will do test version by Tuesday and maybe by Friday it will be bug free. Possibly a lot faster than that.

P.S. Never knew original had that ability, didn't see it in Wulfre's fork.

Tearlow commented 4 years ago

I'm not really sure what you're asking. The way I'd go abouts it would be much simpler.

While looking up a database is a neat feature it isn't really that useful in this case. I'd simply want it to not re-apply the previous files; The easiest way of doing that is to change the way files are processed after download. Have a setting called "bOnlyProcessDownloaded" or something like it. When enabled it'd only process the new files.

That way you could disable that setting if you want to apply everything back again (by running the e621dl again) should you want to. That said; I have not looked over the sourcecode so I may very well be utterly wrong for it being fairly trivial chuckles

EDIT: I am so sorry! I completely forgot it isn't the original. I've spliced projects and the closest fork I see was abouts 6 years agp. You can see it here; https://github.com/wwyaiykycnf/e621dl but the gist of it still applies.

lurkbbs commented 4 years ago

utterly wrong for it being fairly trivial chuckles

Yeah, Wulfre's version should be simple that way. After lots and lots of tiny tiny improvements, like asynchronous api and file downloads, hierarchical filtering, removing files that are no longer meets the search criteria (e.g. a tag was added/removed) and ability to close and continue later, every new option comes with pain.

I'm not really sure what you're asking. The way I'd go abouts it would be much simpler.

And that's why I'm asking in the first place. I got around how to make this "bOnlyProcessDownloaded" option per folder. And how to reset the list if tags are changed (another option). But before all that, I have to make sure if this should be that generic.

I'd simply want it to not re-apply the previous files

Luckily, one of that tiny improvement includes db of all downloaded files, it just deletes on every complete download (that is, not interrupted by closing the app mid-downloading). Simple switch to not delete it and additional filter to not download something in the database should do the thing.

I thought I would do everything yesterday (the per-folder switch version), but wild work-related issues appeared with deadline this Friday. So, I'm not sure when the option will be done, most likely this Sunday (GMT+3). And much simpler per config folder version.

Note: this fork iterates over all ini-files in the config folder. Could be useful if you wan't to download new files and then restore deleted on e621 from local cache in offline mode and then prune all now irrelevant files. I'll assume that all configs should have "bOnlyProcessDownloaded = true", otherwise there will be undefined behavior.

lurkbbs commented 4 years ago

OK, this should work: https://github.com/lurkbbs/e621dl/tree/no_redownload

Just add

[Settings]
<...>
no_redownload = true

Tell me if there are bugs somewhere

lurkbbs commented 4 years ago

OK, change of phrase, tell me if there are bugs, tell me if all works, tell me if you found someone else's script, something or at least close the issue. Seriuosly, believe it or not, I added this only because you asked. I want to be sure this works or at least to debug and then make sure it works.

Tearlow commented 4 years ago

I'm afraid I simply haven't had the time lately to actually sit down and do much of anything. I had hoped to try things out last weekend but ended up working instead.

So please; I will get back to you on this once I can actually sit down and try it out. But for what it's worth- Thank you for taking the time doing this.

Tearlow commented 4 years ago

Slight update,

I've had some spare time to sit down and tinker a bit and from what I've seen it works just as expected. First run downloads anything new- Deleting a few of the files does not repopulate the folders on second execution. Something that confused me though was some files were copied; But from where? This was a fresh folder... Symlinks from somewhere else?

Unrelated to this but... Just an oddity I noticed. When you set up custom formatting for filenames it always append {id}.{file_ext} to names. For example having only {md5} results in {md5}.{id}.{file_ext}. In my example I do have some understanding as to why. md5 isn't exactly collision resistant after all. Still, should it not be up to the user to append file extension and/or id? Anyway, it's just an oddity I noticed :)

lurkbbs commented 4 years ago

Sorry for delay. Especially after my rant.

So please; I will get back to you on this once I can actually sit down and try it out.

Thanks. Sorry, work and viruses (Offtopic: my government just made the same mistake Italy did and made next week holidays but without a quarantine. And just today there was a great piece of news, a summer resort hotels booked 30% more for the next week. I have no words, I have only emotions.). Also, yes, even "I'll try next month" is better than total silence.

Some files were copied; But from where?This was a fresh folder... Symlinks from somewhere else?

I suppose there are same files in different subfolders. Also, hardlinks. They don't need admin privileges on Win10 at least.

it always append {id}.{file_ext}

This is how dict {id: path} is made, there is no separate db for the files. And I always just wanted to add some features with as little pain as possible. Wulfre's fork can only save as {id}.{ext} or {id}.{md5}.{ext}, so dict of all files and ids was easy. Then at some point, a wanted to sort by artist without a db hassle, thus mandatory {id}.{ext}.

At that point it was mostly private fork. Well, it still is, and honestly, almost every new features aren't used by most of users, be it Cloudflare support or conditions (looks like nobody understand how to use it). I know at least one person uses asterisk subfolders, as well as cache for all downloaded files and db to easily restore folders full offline.

Still, should it not be up to the user to append file extension and/or id?

It probably should, except extension, but that would lead to collisions, yes, that I would need to store in db and some naming strategy and that would lead to bugs aeterna and no, thank you :-D

Tearlow commented 4 years ago

I know the feel all too well.

After a few more runs (not related specifically to this special release) but if a downloaded file fits under multiple tags it will mirror it in all related tag folders; Is it possible to prevent duplicates?

Other than that I have not seen any errors what so ever. So as of right now I'd say It is rock solid admittedly very limited testing.

lurkbbs commented 4 years ago

After a few more runs (not related specifically to this special release) but if a downloaded file fits under multiple tags it will mirror it in all related tag folders; Is it possible to prevent duplicates?

That's by design, actually. If a post fits into multiple folders, it should be in multiple folders and not a selected one of them. There is no way at the moment to sort them by first fit in the same folder, though you can make sort them hierarchically e.g

--most generic
|
|--- less generic
     |
     |-- least generic

or you can make criteria mutually exclusive, e.g.

[folder1]
tags = tag1 tag2

[folder2]
tags = tag1 -tag2

I won't merge it with master for now, because Cloudflare switched from Recaptcha to hCaptcha. And since whole country ip range (or maybe just my city, IDK) seems forbidden, and there is no support for <noscript> in hCaptcha, presently I have no means to use e621dl in any way except in full offline mode. Not sure how to bypass that. I could find a good proxy or just get a VPS to make one, but I just don't want to, seeing as the site is accessible without it. I'll experiment next week with cookie coping or outright Chrome with mitmproxy.

lurkbbs commented 4 years ago

Yay, new release, closing this issue