kent-lee / deviantart-scraper

personal project for downloading artworks from DeviantArt
66 stars 14 forks source link

added support for downloading from collections (folders, favourites, and galleries) #5

Open shemetz opened 4 years ago

kent-lee commented 4 years ago

@itamarcu thank you for your contribution; I am very glad that you took the time to improve the functionality of this program :)

I have tested the code and it is working as expected, but there are a few minor problems:

Due to the above issues, I will change some of your codes after merging the pull request. Please let me know if this is okay for you. Also, for the last point, I have made some design choices on the issue, please have a look at here for more details. I am most likely going to implement version 2, but am open to any suggestion.

Thank you.

chrsmlls333 commented 4 years ago

Regarding your design doc, I agree that version 2 is the best approach, but perhaps favourites collections could go in a subfolder separate from the gallery files, under each User! They too have a kind of "All" folder which could interfere with the "All" gallery. This also allows for distinct separation of favs and works.

kent-lee commented 4 years ago

If you like distinction between favs and works, then version 3 is actually better. The main reasons I am leaning towards version 2 are:

  1. it requires minimal changes to the current code, so it would be faster to implement
  2. I assume that people want to download all artworks from a given user, so there is no point in providing options to choose other gallery folders as you are downloading everything

I don't know if this assumption works for the majority of people, so in version 3 and 4, I provide the option to select which gallery folder to download. This has the best output file structure in my opinion, but there are two problems:

  1. as pointed in the doc, it would require a lot more work than version 2
  2. suppose you want to download the artworks in gallery folders A and B from a user, some of the images in folder A may exist in folder B and vice versa, meaning you are downloading duplicate files. This may not be desirable and is especially bad if the gallery folder is All, because then you will be duplicating most, if not all of the files across all other gallery folders
shemetz commented 4 years ago

I am okay with any changes to my code - this is open source after all :)

The use case that the collection-downloading feature is trying to solve is, well, downloading specific collections. Many users (like me) just want to download a specific collection of images - usually either their favorite images or a list of images that fit into a certain theme.

People who are using this feature will probably not want to get extra artworks that they didn't ask for (would only make the process longer and require more storage space). For example, this user has a collection called "Landscapes" with about 40 pictures in it, but the user has many many other collections, so their "all" folder has nearly 10000 pictures!

Therefore, whatever approach you pick, it is quite important that you allow downloading specific collections without downloading all of the artworks/collections of a particular user.

I'll slightly prefer versions 1 and 2, but no strong preference. There is however an extra option - you could have it be like version 1 except the collection names are prefixed by the username. for example, "souveraines - Landscapes". It's not a huge difference from just having a "Landscapes" folder within a "souveraines" folder, though.

chrsmlls333 commented 4 years ago

The main downside of version 2 is the lack of labeling in the filesystem, in this case version 3 is way better, you're right. Based on @itamarcu, I don't think you can assume everyone wants to download all, all the time, so perhaps some folder handling and sorting in the filesystem is necessary.

I would recommend you don't overwhelm yourself and parse all separate galleries by default. Like you say, duplicates and sorting becomes complex. If unspecified in the config or command line, the files could be saved to where they are now or User/Gallery-All/file.jpg? This may allow you to keep your current functionality as the default. Otherwise you risk a user dumping multiple galleries into the root folder for that user and everything getting mixed up.

Regarding config, perhaps this is a good structure?

{
    "save_directory": "D:\\Pictures\\deviantart",
    "users": [
        "GUWEIZ": {
             "galleries": [ "Landscapes" ],
             "collections": [ "All" ]
        },
        "wataboku"
    ]
}

Then wataboku downloads all galleries and no collections by default, the behaviour you already have, and the other is self explanatory. You could check to see if each user in the list is a simple string or object, so they can be input simply or with more definition.

kent-lee commented 4 years ago

Sorry if I wasn't clear, but when I talked about collection folders and gallery folders, those two mean different things.

Collection folders are the folders under FAVOURITES tab on the website. I think it makes sense to download user specified folders and not all collection folders; hence in all versions in the design doc, the collection folders have names collection A, collection B, etc, indicating that they are specific collection folders provided by the users. For example:

save directory
├── souveraines
│   ├── Landscapes
│   │   ├── image1.jpg
│   │   ├── image2.jpg
│   │   ├── image3.jpg
│   │   ...
│   ├── Characters
│   │   ├── image1.jpg
│   │   ├── image2.jpg
│   │   ├── image3.jpg
│   │   ...
│   ...
...

Gallery folders are the folders under GALLERY tab on the website. The original program is set to download the gallery folder All by default; hence in the design doc version 1 and 2, there is no folders like gallery A, gallery B in user A folder, because there is no option to download other specific gallery folders.

save directory
├── souveraines
│   ├── Landscapes
│   │   ├── image1.jpg
│   │   ├── image2.jpg
│   │   ├── image3.jpg
│   │   ...
│   ├── Characters
│   │   ├── image1.jpg
│   │   ├── image2.jpg
│   │   ├── image3.jpg
│   │   ...
│   ├── image1.jpg ──┐
│   ├── image2.jpg ──┼── # these are artworks in gallery all folder
│   ├── image3.jpg ──┘
│   ...
...

So the question I had was that I am not sure if I should allow users to download specific gallery folders, as it adds more complexity to the program and has the duplicate file problem I mentioned before.

As for the potential file structure for version 3 suggested by @chrsmlls333, it looks good to me, but I would probably keep all users consistent like so:

{
    "save_directory": "D:\\Pictures\\deviantart",
    "users": [
        "souveraines": {
             "galleries": [ ],
             "collections": [ "Landscapes" ]
        },
        "wataboku": {
             "galleries": [ "All" ],
             "collections": [ ]
        }
    ]
}

However, with this approach, I am have some difficulties deciding the command line input. For example, what should be the input for adding a user? Something like these?

python main.py artwork -a wataboku-All
python main.py collection -a souveraines-Landscapes

What if the folder name contains spaces or dashes? Is the above proposed json file structure too inconvenient to edit manually?

shemetz commented 4 years ago

Can't users wrap arguments with quotation marks to handle spaces?

On Tue, Oct 15, 2019, 09:53 Kent Lee notifications@github.com wrote:

Sorry if I wasn't clear, but when I talked about collection folders and gallery folders, those two mean different things.

Collection folders are the folders under FAVOURITES tab on the website. I think it makes sense to download user specified folders and not all collection folders; hence in all versions in the design doc, the collection folders have names collection A, collection B, etc, indicating that they are specific collection folders provided by the users. For example:

save directory

├── souveraines

│ ├── Landscapes

│ │ ├── image1.jpg

│ │ ├── image2.jpg

│ │ ├── image3.jpg

│ │ ...

│ ├── Characters

│ │ ├── image1.jpg

│ │ ├── image2.jpg

│ │ ├── image3.jpg

│ │ ...

│ ...

...

Gallery folders are the folders under GALLERY tab on the website. The original program is set to download the gallery folder All by default; hence in the design doc version 1 and 2, there is no folders like gallery A, gallery B in user A folder, because there is no option to download other specific gallery folders.

save directory

├── souveraines

│ ├── Landscapes

│ │ ├── image1.jpg

│ │ ├── image2.jpg

│ │ ├── image3.jpg

│ │ ...

│ ├── Characters

│ │ ├── image1.jpg

│ │ ├── image2.jpg

│ │ ├── image3.jpg

│ │ ...

│ ├── image1.jpg ──┐

│ ├── image2.jpg ──┼── # these are artworks in gallery all folder

│ ├── image3.jpg ──┘

│ ...

...

So the question I had was that I am not sure if I should allow users to download specific gallery folders, as it adds more complexity to the program and will have the duplicate file problem I mentioned before.

As for the potential file structure for version 3 suggested by @chrsmlls333 https://github.com/chrsmlls333, it looks good to me, but I would probably keep all users consistent like so:

{

"save_directory": "D:\\Pictures\\deviantart",

"users": [

    "souveraines": {

         "galleries": [ ],

         "collections": [ "Landscapes" ]

    },

    "wataboku": {

         "galleries": [ "All" ],

         "collections": [ ]

    }

]

}

However, with this approach, I am have some difficulties deciding the command line input. For example, what should be the input for adding a user? Something like these? What if the folder name contains spaces or dashes?

python main.py artwork -a wataboku-All

python main.py collection -a souveraines-Landscapes

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Kent-Lee/deviantart-scraper/pull/5?email_source=notifications&email_token=ABRW7DOSOQXJYL22JIHKOLLQOVSFXA5CNFSM4I7BPDYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBHUQBA#issuecomment-542066692, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRW7DINJHTQCBX3CUEUMSLQOVSFXANCNFSM4I7BPDYA .

chrsmlls333 commented 4 years ago

I think the key to convenience is having fallback behavior. So accept python main.py collection -a souveraines:Landscapes and tokenize based on colons (or another char not allowed in DeviantArt usernames) or python main.py collection -a souveraines which is equivalent to "All"

I understand being wary of too much functionality but to accept individual collections and not individual galleries seems very counter-intuitive. And if you are worried about file duplication, the solution is straightforward, keep "All" in the user root or its own subfolder, and then make subfolders for each gallery. So use version 2 when galleries are not specified and version 3 when they are. This seems the most alike to other scraper tools like dagr.py that have fallen off the wayside recently.

The JSON you specify (more consistent) seems very logical and well arranged.