theelous3 / asks

Async requests-like httplib for python.
MIT License
509 stars 63 forks source link

Trio + asks + instrumentation as progress bar help needed #187

Open rvencu opened 3 years ago

rvencu commented 3 years ago

Hi, first I am not sure this is the place to ask but I feel is most appropriate though.

I am running a classic mass download job with trio and asks libraries. As expected, I launch trio.run from the main thread, I create a nursery and use .start_soon method for every URL in the main function and I perform the task of actual download on the second function.

Now I want to use tqdm to monitor the progress and I am using this trio instrument:

class TrioProgress(trio.abc.Instrument):

    def __init__(self, total, notebook_mode=False, **kwargs):
        if notebook_mode:
            from tqdm.notebook import tqdm
        else:
            from tqdm import tqdm

        self.tqdm = tqdm(total=total, desc="Downloaded: [ 0 ] / Links ", **kwargs)

    def task_exited(self, task):
        if task.custom_sleep_data == 0:
            self.tqdm.update(7)
        if task.custom_sleep_data == 1:
            self.tqdm.update(7)
            self.tqdm.desc = self.tqdm.desc.split(":")[0] + ": [ " + str( int(self.tqdm.desc.split(":")[1].split(" ")[2]) + 1 ) + " ] / Links "
            self.tqdm.refresh()

Let ignore the details and focus on the main task of the progress bar, i.w. to tick once at every processed URL. I thought the second function is the place to add such lines:

async def request_image(datas, start_sampleid):
    tmp_data = []

    import asks
    asks.init("trio")

    session = asks.Session(connections=64)
    session.headers = {
        "User-Agent": "Googlebot-Image",
        "Accept-Language": "en-US",
        "Accept-Encoding": "gzip, deflate",
        "Referer": "https://www.google.com/",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    async def _request(data, sample_id):
        url, alt_text, license = data
        *task = trio.lowlevel.current_task()*
        *task.custom_sleep_data = None*
        try:
            proces = process_img_content(
                await session.get(url, timeout=5, connection_timeout=40), alt_text, license, sample_id
            )
            if proces is not None:
                tmp_data.append(proces)
                *task.custom_sleep_data = 1*
        except Exception:
            return

Except that if I count the ticks they are not equal to the size of my URL list. So the progress bar is not answering the basic question: "how long until finish"

Experimenting with 1 tick at every exit from the second function, the intuitive way, I noticed the ticks are about 2.5 - 3 times more than expected. But depending on the actual URL list this can go up to 7 as in the above example.

I would like to understand what is happening and maybe find a way to properly count finished download tasks (successful or unsuccessful). Succesful ones I was able to count correctly by confirming the actual download but all others are in the mist...