tags in filename for sankaku or other booru host

wankio commented 6 years ago

1, host_id_rawfilename Can it change to host_id_tags ? Because i don't see option in config file and filename already limited to 255char ?

2, does it have link history to avoid dupplicate downloaded file ? like ripme

3, can it have filename format like below software ? https://github.com/Nandaka/DanbooruDownloader

thank 👍

mikf commented 6 years ago

1, host_id_rawfilename Can it change to host_id_tags ? Because i don't see option in config file and filename already limited to 255char ? ... 3, can it have filename format like below software ?

You can configure the output filename and directory with the extractor.filename and extractor.directory options. To change the filename format for sankaku to "host_id_rawfilename", you would put something like this in your config file:

{
  "extractor": {
    "sankaku": {
      "filename": "{category}_{id}_{tags}.{extension}"
    }
  }
}

2, does it have link history to avoid dupplicate downloaded file ? like ripme

gallery-dl skips downloads for files that already exist and there is also the archive option (also available with the --download-archive command-line switch)

wankio commented 6 years ago

oh thank, i will try that :)

update : in config.json "sankaku": { "username": null, "password": null, "wait-min": 2.5, "wait-max": 5.0, "filename": "{category}{id}{tags}.{extension}" }, it have Errno 22 Invalid argument

mikf commented 6 years ago

The config snippet you posted looks fine and should work.

Could you post the whole output when you run gallery-dl with the --verbose option? It would be helpful to know where exactly this exception occurs.

wankio commented 6 years ago

I:\DOWNLOADS\Command tools>gallery-dl https://chan.sankakucomplex.com/?tags=chan_co --verbose
[gallery-dl][debug] Version 1.4.2
[gallery-dl][debug] Python 3.4.4 - Windows-10-10.0.17134
[gallery-dl][debug] requests 2.19.1 - urllib3 1.23
[gallery-dl][debug] Starting DownloadJob for 'https://chan.sankakucomplex.com/?tags=chan_co'
[sankaku][debug] Using SankakuTagExtractor for 'https://chan.sankakucomplex.com/?tags=chan_co'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): chan.sankakucomplex.com:443
[urllib3.connectionpool][debug] https://chan.sankakucomplex.com:443 "GET /?tags=chan_co&page=1 HTTP/1.1" 200 None
[urllib3.connectionpool][debug] https://chan.sankakucomplex.com:443 "GET /post/show/7024858 HTTP/1.1" 200 None
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): cs.sankakucomplex.com:443
[urllib3.connectionpool][debug] https://cs.sankakucomplex.com:443 "GET /data/64/bf/64bf0aa8829e737468e9a0a229ad0166.jpg?e=1531388877&m=Y6qa7KMsjcFbb6NBDTI6pQ HTTP/1.1" 200 695172
  .\gallery-dl\Chan.Sankaku\chan_co\7024858_2018-07-11 03... hair, white bikini, white gloves, white swimsuit, wink.jpg
[sankaku][error] Unable to download data: [Errno 22] Invalid argument: '\\\\?\\I:\\DOWNLOADS\\Command tools\\gallery-dl\\Chan.Sankaku\\chan_co\\7024858_2018-07-11 03_45_fate (series), fate_grand order, bb (fate), chan co, simple background, 1_1 aspect ratio, 1girl, asymmetrical hair, bangs, bikini, black choker, breasts, choker, clavicle, cleavage, ;d, eyebrows visible through hair, female, front-tie bikini, front-tie top, gloves, hair ornament, hair ribbon, hand on hip, hand up, large breasts, long hair, long sleeves, looking at viewer, megane, navel, one eye closed, open mouth, pointer, ponytail, purple eyes, purple hair, red ribbon, ribbon, rimless eyewear, side ponytail, side-tie bikini, smile, solo, star, swimsuit, tied hair, very long hair, white bikini, white gloves, white swimsuit, wink.jpg.part'

My new config

"filename": "{id}_{created_at}_{tags}.{extension}",
            "directory":["Chan.Sankaku","{search_tags}"],
            "archive": "./gallery-dl/archive-chan.sankaku.sqlite3"

mikf commented 6 years ago

OK, that filename is way too long (670 characters) and there is currently, as also noted in #92, no way to prevent that.

I guess too long filenames could just be cut short to fit into the 255 character limit, but a more configurable approach (like string slicing for format string replacement fields) would be nice as well. I'll think of something ...

And, by the way: Python, at least on Linux, recognizes long filenames: OSError: [Errno 36] File name too long, so I wasn't quite sure how this error came to be. But on Windows you either get [Errno 2] No such file or directory or [Errno 22] Invalid argument.

wankio commented 6 years ago

that's what i'm thinking...filename too long because we can't limited how many tags can be add in filename...anyway , thank :)

and can it support these format too ?

- %provider%    = provider Name
- %id%      = Image ID
- %tags%    = Image Tags
- %rating%  = Image Rating
- %md5%     = MD5 Hash
- %artist%  = Artist Tag
- %copyright%   = Copyright Tag
- %character%   = Character Tag
- %circle%  = Circle Tag, yande.re extension
- %faults%  = Faults Tag, yande.re extension
- %originalFilename% = Original Filename
- %searchtag%   = Search tag

mikf commented 6 years ago

All of these fields are already available, but under different names.

%provider% -> {category}
%id% -> {id}
%tags% -> {tags} (or {tag_string} on danbooru)
%rating% -> {score}
%originalFilename% -> {name}.{extension}

and so on. The exact names depend on the booru board in question, as gallery-dl is just using the API responses without much modification. Take a look at the output with -K to get a complete list of replacement field names.

To enable {tags_artist}, {tags_character} and so on, you need to set extractor.*.tags to true.

wankio commented 6 years ago

so after you add option to prevent long filename, i just need add tags:true in extractor {sankaku..} to enable artist/character ?
Can gallery-dl use this search_tags ? : [tags]+date:<=yyyy.mm.dd because after 1000result downloaded you can't download anymore...so you need add +date:<=yyyy.mm.dd after tag to have download more 1000result. yyyy.mm.dd is created_at i think so

to compare with danbooru downloader and other, i think gallery-dl is better
1 - low memory usage (i think because it only use one thread instead multi-thread) 2 - archive (skipped downloaded id) (ripme have it but danbooru downloader and other dont) 3 - bunch download from pastebin (ripme have rip from clipbloard, danbooro downloader dont have)

mikf commented 6 years ago

so after you add option to prevent long filename, i just need add tags:true in extractor {sankaku..} to enable artist/character ?

Yes, but it would be easier to enable this option for all boorus by just setting extractor.tags to true. Otherwise you would have to enable it for each site individually, i.e. extractor.sankaku.tags, extractor.gelbooru.tags, and so on.

Concerning filename lengths: you can now (since 8fe9056b16cbbb14b7e94fa92a8c8369cee654a9) slice values in format strings. {tags[:200]} would limit it to 200 characters max - everything after that will be cut off.

Can gallery-dl use this search_tags ? : [tags]+date:<=yyyy.mm.dd because after 1000result downloaded you can't download anymore...so you need add +date:<=yyyy.mm.dd after tag to have download more 1000result. yyyy.mm.dd is created_at i think so

It can, but that's not necessary if you want to go past 1000 results / page 50. You don't even need to provide username and password if you want to go past page 25. Being logged in only lets you use more than 5 tags at once and allows you to jump to higher page numbers faster (with --range 800-, for example)

ghost commented 6 years ago

[danbooru][error] An unexpected error occurred: AttributeError - 'list' object has no attribute 'startswith'.

Edit: this is my first post here. am i doing it right?

mikf commented 6 years ago

You should open a new issue, post the URL in question and, if possible, the complete error output with --verbose.

wankio commented 6 years ago

ok i will test it soon :)

wankio commented 6 years ago

Yes, but it would be easier to enable this option for all boorus by just setting extractor.tags to true. Otherwise you would have to enable it for each site individually, i.e. extractor.sankaku.tags, extractor.gelbooru.tags, and so on.

Concerning filename lengths: you can now (since 8fe9056) slice values in format strings. {tags[:200]} would limit it to 200 characters max - everything after that will be cut off.

     "sankaku":
        {
            "username": null,
            "password": null,
            "wait-min": 2.5,
            "wait-max": 5.0,
            "filename": "{tags_artist}_{tags[:200]}_{id}_{created_at}_.{extension}",
            "directory":["Chan.Sankaku","{search_tags}"],
            "tags": true      
        },

[sankaku][error] Applying filename format string failed: TypeError: string indices must be integers

When i'm even not set {tags} ..gallery-dl still only set filename as {id}_{createdat}.{extension} instead {tagsartist}{id}_{createdat}.{extension}

It can, but that's not necessary if you want to go past 1000 results / page 50. You don't even need to provide username and password if you want to go past page 25. Being logged in only lets you use more than 5 tags at once and allows you to jump to higher page numbers faster (with --range 800-, for example)

so if i input tags have higher than 1000result, it will keep downloading until have nothing to download ?

mikf commented 6 years ago

[sankaku][error] Applying filename format string failed: TypeError: string indices must be integers

When i'm even not set {tags} ..gallery-dl still only set filename as {id}{created_at}.{extension} instead {tags_artist}{id}{createdat}.{extension}

You are using version 1.4.2 and not the latest git snapshot. The {tags[:200]} thing and the tags option for sankaku hasn't been "officially" released yet. Do a pip install --upgrade https://github.com/mikf/gallery-dl/archive/master.zip and try again.

so if i input tags have higher than 1000result, it will keep downloading until have nothing to download ?

Yes, it only stops after downloading all search results, but you can set a custom upper limit with, again, the --range option.

wankio commented 6 years ago

oh nice ty, installed python version and it worked :)

i just test host local file and using r:link to batch download, wow it work too :)

mikf commented 6 years ago

host local file and using r:link to batch download

  -i, --input-file FILE     Download URLs found in FILE

And to quote myself from the other issue: You can now use the L format specifier to set a replacement if the format field value is too long. For example {tags:L100/too many tags/} (https://github.com/mikf/gallery-dl/commit/e0dd8dff5f626a42678a916780b31f0193aef7ca).

wankio commented 6 years ago

thank, so i need update gallery-dl again ?

mikf commented 6 years ago

Only if you want to use the L format specifier feature.

wankio commented 6 years ago

oh today it stop working in 3 hours....no error, just stop download. Command Window still processing but it dont download any new link in 3hours (checked website, still no error)
and with archive option in sankaku extractor, why i feel it so slow to check downloaded link. Wait Min/Max 2/5 but sometime it wait 8-10 or maybe 20+ seconds to just check files

mikf commented 6 years ago

oh today it stop working in 3 hours....no error, just stop download.

Hmm, there is a slim possibility that a HTTP requests "gets stuck" and the client waits forever for a reply from the remote server. Some HTTP requests send by gallery-dl - for some reason - don't have a timeout, so it probably happened with one of those. Fixing this should be easy. In the meantime: Ctrl+c and try again.

why i feel it so slow to check downloaded link

Because it has to get download URL and metadata before it can check if a file has already been downloaded (same as youtube-dl). It doesn't help that Sankaku is incredibly slow itself, so you have to wait 2-5 seconds before each HTTP request (to avoid 429 Too Many Requests errors) and then you have to wait for the request itself to finish, which might take another 5 seconds.

When using sakaku stuff, you should really use the --range command-line option when necessary, as it allows the extractor to quickly jump ahead. gallery-dl --range 250- URL... is going to immediately jump to image nr. 250 and start from there.

wankio commented 6 years ago

yeah...it's easy to fix with --range you told me

Being logged in only lets you use more than 5 tags at once and allows you to jump to higher page numbers faster (with --range 800-, for example)

5tags at once, you mean 5 tags combined : ?tags=dynasty_warriors brown_hair china_dress female shoes right ?

Because it has to get download URL and metadata before it can check if a file has already been downloaded (same as youtube-dl). It doesn't help that Sankaku is incredibly slow itself, so you have to wait 2-5 seconds before each HTTP request (to avoid 429 Too Many Requests errors) and then you have to wait for the request itself to finish, which might take another 5 seconds.

sometime it wait 15-20seconds is normal ?

When using sakaku stuff, you should really use the --range command-line option when necessary, as it allows the extractor to quickly jump ahead. gallery-dl --range 250- URL... is going to immediately jump to image nr. 250 and start from there.

so i need to count downloaded files and compare with tags(totalresult) to know exactly range i need to put in right ?

It should have feature to skipped tags once it reach downloaded files (so it just only download newer pictures and stopped once it reach downloaded files if extractor archive option enabled)

mikf commented 6 years ago

yeah...it's easy to fix with --range you told me

That is not what I meant. I wanted to say "It's easy for me to add a timeout to regular HTTP requests, so it doesn't get stuck anymore" -> https://github.com/mikf/gallery-dl/commit/68d6033a5d260dd3ea8823edede5d16d50e45aae

5tags at once, you mean 5 tags combined : ?tags=dynasty_warriors brown_hair china_dress female shoes right ?

Right.

sometime it wait 15-20seconds is normal ?

Not really, no. I might be the case that the wait-min/-max default values are too low and you get 429 Too Many Requests responses from sankaku. In that case gallery-dl retries the original request after waiting for a bit, but it can take quite a bit of time until sankaku sends a normal response.

You can enable verbose output (-v) to see what goes on behind the scenes. If you encounter anything 429 related, increase wait-min/-max until this doesn't happen anymore.

so i need to count downloaded files and compare with tags(totalresult) to know exactly range i need to put in right ?

Your computer can count them for you: ... and you don't need the exact range, the start index is enough.

--range 200-300 will download anything from 200 to 300, but you can omit the end index (--range 200-) to download from 200 to the end or the start index to download up to 300 (--range -300).

It should have feature to skipped tags once it reach downloaded files (so it just only download newer pictures and stopped once it reach downloaded files if extractor archive option enabled)

  --abort-on-skip           Abort extractor run if a file download would
                            normally be skipped, i.e. if a file with the same
                            filename already exists

or the extractor.skip option

wankio commented 6 years ago

thank you

can we have a option to download sample if original file dimension is too big, depend on width or height ?
some files have 9000-10000px width , if we limit maximum width(3000 maybe) it will download sample instead

mikf commented 6 years ago

Not going to happen. You can download the original and then down-sample it yourself, or ignore it with --filter.

You should also open a new issue if you want to suggest a new feature. This one here is closed for a reason.

wankio commented 6 years ago

ok thank

mikf / gallery-dl

tags in filename for sankaku or other booru host #94