kent-lee / pixiv-scraper

personal project for downloading artworks from Pixiv
40 stars 5 forks source link

Challenge: Download user favorites (and create recommendations) #1

Open DonaldTsang opened 5 years ago

DonaldTsang commented 5 years ago

Same as https://github.com/Kent-Lee/deviantart-scraper/issues/2 but with a few key differences:

kent-lee commented 5 years ago

@DonaldTsang I have completed and uploaded implementations of (1), (2), and (3), please have a look at the newest commits and readme for details.

For (1), you can download all bookmarked artworks with the command python main.py bookmarks. You can also call api.user_bookmarks() in main.py to get a list of JSON objects of the artworks, which contains information like artist_id, tag, etc.

For (2), the original implementation already stores some basic metadata for all downloaded artwork, such as the artwork_id, title, and filename. You can view them by printing the results of save_users(), save_artwork(), save_artworks(), and save_bookmarks(). The reasons I don't write them to files are: (1) there is no need. (2) I want to avoid I/O bound tasks as much as possible because they greatly impact performance.

For (3), you can get recommended artists by calling recommend(). This function uses percentage to sort artists, as suggested, though I am having difficulties determining the threshold of the cutoff point. You mentioned that to recommend artist A from user U's bookmarks, U's bookmarks should contain more than X% of artworks from A. So, what should X be?

As for your questions. I think your suggested recommendation feature is quite useful in cases where the user's bookmarks are to your liking. I have thought of two approaches before for art discovery: (1) based on rankings. (2) based on related work. (1) has the problem of popular != what I like. (2) has the problem where you are only able to find artists with similar art style. So, I think your suggestion is better in terms of consistency and accuracy. However, this is only if the user's bookmarks are good, and if this is not the case, then this may perform worse than the above methods.

And yes, I do have a Discord account; my DiscordTag is Bruce Lee#5354. Feel free to add me :)

DonaldTsang commented 5 years ago

@Kent-Lee thanks!

DonaldTsang commented 5 years ago

@Kent-Lee about (3) in the former you might want to read https://en.wikipedia.org/wiki/Penrose_square_root_law It states that for any given any population X within a larger population A+B+C...+W+X, its voting power, or worth, by percentage is sqrt(X)/(sqrt(A)+sqrt(B)+sqrt(C)+...+sqrt(W)+sqrt(X)).

Regarding (1) and (2) in the latter (1) is mainly experiental, while (2) is much accurate in most cases, assuming most good artists are also good collectors. Thus we need to balance discovery and similarity.

kent-lee commented 5 years ago

@DonaldTsang sorry about the late reply; I am quite busy recently.

For (3), I guess my wording wasn't clear. I was just unsure of the value of the threshold. I don't know the correct method to determine the value of the cutoff point such that the recommendations are accurate and of good quality.

DonaldTsang commented 5 years ago

@Kent-Lee the best thing to do is to separate the two views, for me personally "the total number of artworks is low and the average number of common artworks from the same artists is high" makes much more sense. Common shared artwork amounts (either by percentage or absolute amount) should be the base metric, of course we can do something more complex like two or more users sharing the same bookmarks, but right now we can assume all bookmarks from the list of users go into ONE pool.

DonaldTsang commented 5 years ago

Okay so I discovered PageRank and HITS (and also SALSA), maybe you can try and use this tool to find relevant items? The network would basically have these three components:

  1. Artist =>Favorite page of the artist
  2. Favorites => List of favorited art
  3. Art => Artist that made the art

And if you would allow follows/followers:

  1. Artist =>Follow page of the artist
  2. Follows => List of followed artist

Of course if we want to have follows to go along with favorites there will need to be a weighing system between the two type of link "methods" when using PageRank or HITS. See: https://networkx.github.io/documentation/stable/reference/algorithms/link_analysis.html

kent-lee commented 5 years ago

@DonaldTsang thank you for yet another great suggestion! I just implemented ranking functionality to get the top N ranking artworks given certain parameter values. I was thinking to use this to find good recommendations and relevant items, however, you just gave me another method that I can try. So, thank you again for the information, I will certainly get to it shortly.

DonaldTsang commented 5 years ago

@Kent-Lee no need to thank me, we are all on the same boat.

DonaldTsang commented 4 years ago

For Pixiv, to get what an artist follow just go to https://www.pixiv.net/bookmark.php?id=<some_id>&type=user, all followings within that page will be under <a> inside <li> within the page's <ul> list, and everything is paginated so you can move to the next page with p=2 or some other number From that we can pull some tricks from "Twitter Following Graphs" (who follows who on Twitter) and rank people based on what they liked (link prediction and community detection).