mvabdi / vsco-scraper

Easily allows for scraping a VSCO
MIT License
133 stars 25 forks source link

Scraper only collects 118 byte files #49

Open Pr0j3ct opened 1 month ago

Pr0j3ct commented 1 month ago

Approx 2 weeks ago the scraper only started collecting 118 byte files.

Does not appear to be IP address related. Has the VSCO API changed?

sideloading commented 1 month ago

Same issue here #48. I'm using https://github.com/mikf/gallery-dl which is working fine

Pr0j3ct commented 1 month ago

One thing I noticed was that the sub-domain returns 403: i.vsco.co

but using url like so: vsco.co/i

returns the image without problem.

I'm no programmer but when I have some free time I may try and refactor atleast one of the modules to support that change and see what happens.

intothevoid33 commented 1 month ago

@Pr0j3ct what do you mean?

I put a print statement into the script to see what it was trying to download. What printed out matched what I got when manually going to the gallery page, selecting and image and then inspecting it.

parkerr82 commented 1 month ago

The API has definitely changed.

Digging through the gallery-dl project I can see that they’re using a different API call

It’s essentially /api/3.0/ Whereas the current version of this project uses /api/2.0/

On Thu, Jul 25, 2024 at 10:21 AM Project @.***> wrote:

One thing I noticed was that the sub-domain returns 403: i.vsco.co

but using url like so: vsco.co/i

returns the image without problem.

I'm no programmer but when I have some free time I may try and refactor atleast one of the modules to support that change and see what happens.

— Reply to this email directly, view it on GitHub https://github.com/mvabdi/vsco-scraper/issues/49#issuecomment-2250659790, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXYLG6DADYKKHKEGHF52CCTZOEJXRAVCNFSM6AAAAABLIMKXR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJQGY2TSNZZGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

timbo0o1 commented 1 month ago

Edit: Seems like they block the default request header which is used by the script.

You could simply set a custom header to your requests to get the images.

  1. create a new entry in constants.py

    images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    }
  2. use them in vscoscrape.py

    def download_img_normal(self, lists):
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

Alternatively you could use cloudscraper instead of the python requests.

pip install cloudscraper

import cloudscraper
class Scraper(object):
    def __init__(self, cache, latestCache):
        self.cache = cache
        self.latestCache = latestCache
        self.scraper = cloudscraper.create_scraper()
def download_img_journal(self, lists):
        """
        Downloads the journal media in specified ways depending on the type of media

        Since Journal items can be text files, images, or videos, I had to make 3
        different ways of downloading

        :params: lists - No idea why I named it this, but it's a media item
        :return: a boolean on whether the journal media was able to be downloaded
        """
        if lists[1] == "txt":
            with open(f"{str(lists[0])}.txt", "w") as file:
                file.write(lists[0])
        if lists[2] == "img":
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(self.scraper.get(lists[0], stream=True).content)

        elif lists[2] == "vid":
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in self.scraper.get(lists[0], stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        self.progbarj.update()
        return True
def download_img_normal(self, lists):
        """
        This function makes sense at least

        The if '%s.whatever' sections are to skip downloading the file again if it's already been downloaded

        At the time I wrote this, I only remember seeing that images and videos were the only things allowed

        So I didn't write an if statement checking for text files, so this would just skip it I believe if it ever came up
        and return True

        :params: lists - My naming sense was beat. lists is just a media item.
        :return: a boolean on whether the media item was downloaded successfully
        """
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(self.scraper.get(lists[0], stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in self.scraper.get(lists[0], stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True
intothevoid33 commented 1 month ago

Edit: Seems like they block the default request header which is used by the script.

You could simply set a custom header to your requests to get the images.

  1. create a new entry in constants.py
images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}
  1. use them in vscoscrape.py
def download_img_normal(self, lists):
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

That works perfectly, thank you!

spilla7 commented 3 weeks ago

Co

Edit: Seems like they block the default request header which is used by the script.

Could someone please explain how to do this? Would like to get this working again. I've tried gallery-dl but prefer vscoscraper.

timbo0o1 commented 3 weeks ago

Co

Edit: Seems like they block the default request header which is used by the script.

Could someone please explain how to do this? Would like to get this working again. I've tried gallery-dl but prefer vscoscraper.

I´ve already explained how to do this. Where exactly do you need help?

spilla7 commented 2 weeks ago

Co I´ve already explained how to do this. Where exactly do you need help?

I can see where to replace the txt in the constants.py file. But I'm not sure where to add the txt to the vscoscrpae.py file.

I've tried adding at the end but i get an error message when I run the script

Cheers

AxelConceicao commented 2 weeks ago

Co I´ve already explained how to do this. Where exactly do you need help?

I can see where to replace the txt in the constants.py file. But I'm not sure where to add the txt to the vscoscrpae.py file.

I've tried adding at the end but i get an error message when I run the script

Cheers

nothing to replace in constants.py, just add images dict and add headers=constants.images like he did in download_img_normal func

billyklubb commented 2 weeks ago

Edit: Seems like they block the default request header which is used by the script.

You could simply set a custom header to your requests to get the images.

1. create a new entry in constants.py
images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}
2. use them in vscoscrape.py
def download_img_normal(self, lists):
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

Alternatively you could use cloudscraper instead of the python requests.

pip install cloudscraper

import cloudscraper
class Scraper(object):
    def __init__(self, cache, latestCache):
        self.cache = cache
        self.latestCache = latestCache
        self.scraper = cloudscraper.create_scraper()
def download_img_journal(self, lists):
        """
        Downloads the journal media in specified ways depending on the type of media

        Since Journal items can be text files, images, or videos, I had to make 3
        different ways of downloading

        :params: lists - No idea why I named it this, but it's a media item
        :return: a boolean on whether the journal media was able to be downloaded
        """
        if lists[1] == "txt":
            with open(f"{str(lists[0])}.txt", "w") as file:
                file.write(lists[0])
        if lists[2] == "img":
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(self.scraper.get(lists[0], stream=True).content)

        elif lists[2] == "vid":
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in self.scraper.get(lists[0], stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        self.progbarj.update()
        return True
def download_img_normal(self, lists):
        """
        This function makes sense at least

        The if '%s.whatever' sections are to skip downloading the file again if it's already been downloaded

        At the time I wrote this, I only remember seeing that images and videos were the only things allowed

        So I didn't write an if statement checking for text files, so this would just skip it I believe if it ever came up
        and return True

        :params: lists - My naming sense was beat. lists is just a media item.
        :return: a boolean on whether the media item was downloaded successfully
        """
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(self.scraper.get(lists[0], stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in self.scraper.get(lists[0], stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

Hey, so I am not a programmer in the least, the first two files you are referring to constants.py and vscoscrape.py, where are those located? and where are those new entries supposed to be in the files you mention? Of course any help is sincerely appreciated!

Edit: so when I look through the git for vsco-scraper I see the two files you are talking about, I am not sure what I am supposed to do with those files. I installed vsco-scraper with pip, so in this case do I need to edit the source and perform a build/compile or something along those lines? Forgive me, I only know that the vsco-scraper is in the bin folder off of my linux profile, after that I have zero ideas on what to do... =(

timbo0o1 commented 2 weeks ago

Hey, so I am not a programmer in the least, the first two files you are referring to constants.py and vscoscrape.py, where are those located? and where are those new entries supposed to be in the files you mention? Of course any help is sincerely appreciated!

Edit: so when I look through the git for vsco-scraper I see the two files you are talking about, I am not sure what I am supposed to do with those files. I installed vsco-scraper with pip, so in this case do I need to edit the source and perform a build/compile or something along those lines? Forgive me, I only know that the vsco-scraper is in the bin folder off of my linux profile, after that I have zero ideas on what to do... =(

if you installed vscoscrape with pip the files are located in your python installation. Edit: to locate a pip package you can use the command "pip show vsco-scraper" for example C:\Python310\Lib\site-packages\vscoscrape You find the files there. (constants.py / vscoscrape.py).

No need to build from source. Just use the pip package and do the following. Now open constants.py with your text editor and at the end of the file you paste this:

images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

Now open vscoscrape.py and search for download_img_normal From there you select the whole function (until "return true") Then you copy my function and replace it:

def download_img_normal(self, lists):

        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True
billyklubb commented 2 weeks ago

Hey, so I am not a programmer in the least, the first two files you are referring to constants.py and vscoscrape.py, where are those located? and where are those new entries supposed to be in the files you mention? Of course any help is sincerely appreciated! Edit: so when I look through the git for vsco-scraper I see the two files you are talking about, I am not sure what I am supposed to do with those files. I installed vsco-scraper with pip, so in this case do I need to edit the source and perform a build/compile or something along those lines? Forgive me, I only know that the vsco-scraper is in the bin folder off of my linux profile, after that I have zero ideas on what to do... =(

if you installed vscoscrape with pip the files are located in your python installation. Edit: to locate a pip package you can use the command "pip show vsco-scraper" for example C:\Python310\Lib\site-packages\vscoscrape You find the files there. (constants.py / vscoscrape.py).

No need to build from source. Just use the pip package and do the following. Now open constants.py with your text editor and at the end of the file you paste this:

images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

Now open vscoscrape.py and search for download_img_normal From there you select the whole function (until "return true") Then you copy my function and replace it:

def download_img_normal(self, lists):

        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

Thank you very much!! Those changes were easy enough, first attempt gave me an indentation error, I just needed to move the "def download_img_normal(self, lists):" line over a tab space to line up with all the others and it ran without issue! I really appreciate your time! =)

Edit: I tested if for journals, it produces the 118k files, I tried to sort it out, the block for journals is very different...

Edit: I figured it out, I looked for the function for downloading journals, and added "headers=constants.images" to the jpg and mp4 lines and it worked like a charm!

I'm certainly not a python programmer now...lol but reading through your code, I see that constants.images must refer to the constants.py file and the .images must refer to the images entry that you had me add! Thanks for helping me see it! =)