[OneNote] - HTTP 429 rate limitingduring importing

kquinsland commented 7 months ago

I am trying to import a particular OneNote notebook and am unable to do so w/o hitting rate-limit errors.

In the user interface, the rate-limit errors are manifesting themselves like so:

Failed: "Exported image 20240329083022-5.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083024-2.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083024-3.png" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083024-2.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083024-3.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "2019-2-21" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "Exported image 20240329083025-0.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "2019-2-22" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "2019-2-23" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "2019-2-24" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "2019-2-25" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "2019-2-26" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "Exported image 20240329083024-5.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "2019-2-28" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "Exported image 20240329083027-0.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083027-1.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083027-2.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...

But in the dev-tools console, the underlying 429 shows up:

(Not sure why I can't copy stack trace)

Screenshot 2024-03-29 083749

Other details:

This is all on windows with the 1.6.1 version of the importer plugin.
The One Note notebook is a few years worth of ~daily notes (1.3 notes/day, technically).
Structure is Section per year, sub-section per month and a little over 30 notes per sub-section. Most notes have one or more attachments; usually images but not always.
About 12 hours ago I tried importing everything for this notebook. I hit 429 limits early on.
After hitting those limits, I tried to re-start the import process knowing that I'll need to only select a few sections at a time for importing but the import plugin didn't even get to show me the sections/sub-sections for the notebook in question because of rate limits. It was late / I was tired so I didn't think to take a screenshot of the generic error shown to the user when there's a failure in enumerating sections so the user can select what to import
I figured that I'd sleep on this, let the request quota buckets empty and try again this AM.
I was able to list all the sections/sub-sections in my notebook; I selected the first 6 sub-sections (one per month, ~180 notes, total) of one particular year and ... hit the 429 errors shown above after - roughly - the first two months. The last successfully created file is $valuts\on-import\OneNote\new-gr\2019\02\2019-2-27.md but the log output above indicates that content in the 2019-2-22 file may be missing / corrupt

kepano commented 7 months ago

@p3rid0t do you have an idea of what might be going on? Potentially related to #120

p3rid0t commented 7 months ago

This is weird... 🤔

The function detected the error (noticeable by the message 'An error has occurred while fetching [...]', but the if statement that's right after it which checks for error code 20166, is not executed. I'll try to follow up soon

kquinsland commented 7 months ago

The function detected the error (noticeable by the message 'An error has occurred while fetching [...]', but the if statement that's right after it which checks for error code 20166, is not executed. I'll try to follow up soon

Take a closer look at the object in the console. I hot-patched it:

                    s.error.code === "20166") {
                        let l = +!o.headers.get("Retry-After") * 1e3 || 15e3;
                        if (console.log(`Rate limit exceeded, waiting for: ${l} ms`),
                        r < a1)
                            return await new Promise(c=>setTimeout(c, l)),
                            this.fetchResource(e, i, r + 1);
                        throw new Error("Exceeded maximum retry attempts")
                    }

I'm not a JS dev nor do I have much time to set up an env so I'm just patching the minified file in .obsidian/plugins... and reloading obsidian

After the patch, console looks promising:

(why the hell can't I right click, copy in the obsidian dev tools console!?... i can do that on ff/chrome dev tools console :( )

kquinsland commented 7 months ago

Hmm. I was a bit too optimistic after the change and managed to get my self put back in the cool-down bucket.

I can't even list notebooks now (this is the same behavior I was getting last night when trying to import everything...)

And here's the kicker:

Note that there is no Retry-After header sent by microsoft?

And if I'm reading this right:

let l = +!o.headers.get("Retry-After") * 1e3 || 15e3;

Missing retry-after header will mean 1000 ms because you're negating what will probably be null or undefined or whatever .get() does when the header isn't present?

The docs suggest that the Retry-After header won't be present in at least some cases:

The resources listed earlier do not return a Retry-After header on 429 Too Many Requests responses.

So the fix here is to bump up the number of max retry attempts and to exponentially wait longer when retry-after is not present?

kquinsland commented 7 months ago

I figured out how to get the un-minified JS coded into the plugin dir so I could start hacking on it, came up with this:

if (s.error.code === "20166") {
    let retryAfter = o.headers.has("Retry-After") ? +o.headers.get("Retry-After") * 1000 : null;

    if (!retryAfter) {
        console.warn("Rate limit exceeded. Retry-After header not present.");
    }

    // Exponential backoff
    let exponentialBackoff = Math.pow(2, r) * 1000; // Exponential backoff in milliseconds

    let waitTime = retryAfter || exponentialBackoff;

    console.log(`Rate limit exceeded, waiting for: ${waitTime} ms`);

    if (r < a1) {
        await new Promise(resolve => setTimeout(resolve, waitTime));
        return await this.fetchResource(e, i, r + 1);
    } else {
        throw new Error("Exceeded maximum retry attempts");
    }
}

but it looks like I need to spend more time cooling down as I can't even finish logging in / enumerating stuff to fetch:

kquinsland commented 7 months ago

After cooling down a bit, I was able to complete the log in (side note: is there an open issue to improve that functionality so auth can persist?) and found that I never got a retry-after header... which makes sense in light of the docs linked above.

Given that the header in question likely won't be present for the majority of requests / cases, I'm at a bit of a loss as to how to "fix" this:

Currently I set the retry count to 10 (instead of 5) and have a super aggressive back off Math.pow(3) which is working, but means that importing more than a few hundred notes at a time will take forever.

I've not looked at the API too closely but it seems like the correct way would be to get a full section/sub-section at a time instead of one request for each page in a given section/sub-section? Assuming there's a reliable delimiter for each "page" that would get returned in that one request, it should be easy-ish to just split the payload on that delim and feed it in to the current parse/extract/save code?

Alternatively, persist auth details so it's (slightly) less of a PITA to launch the import flow and select a sub-section at a time?

Not clear to me if the plugin looks to see if a file exists on disk or not before making the request for a given page. E.g: let's say that I select sub sections "08" and "09" from the "2005" section for a particular notebook and I start hitting rate limits on 2005/09/20.md. I'll power down computer for a while to let the request buckets empty before trying again. When finally able to start again, I'll end up burning some of the precious requests on files that already exist?

RazgovovEvgen22 commented 4 months ago

Not clear to me if the plugin looks to see if a file exists on disk or not before making the request for a given page.

I haven’t looked at the code, but when I resynchronize, duplicate files are created with an additional number at the end of the name

s4kutai commented 3 months ago

Hello. I faced with the same situation. I tried importing 600+ pages from OneNote. However, the 429 error occurred. I looked for the solution and there are two things I'd like to say as a result.

First, as @kquinsland hot-pached, the importer's rate-limit handling seems broken. The importer expects the error response like:

{
    "code": "20166",
    ...
}

However, the actual response is like:

{
    "error": {
        "code": "20166",
        ...
    }
}

So, we can hot-patch this behavior with the patch like this:


diff --git a/src/formats/onenote.ts b/src/formats/onenote.ts
index fefc278..31ecbc6 100644
--- a/src/formats/onenote.ts
+++ b/src/formats/onenote.ts
@@ -697,12 +697,13 @@ export class OneNoteImporter extends FormatImporter {
                                }
                        }
                        else {
-                               const err: PublicError = await response.json();
+                               const err_raw = await response.json();
+                               const err: PublicError | undefined = err_raw.error;

                                console.log('An error has occurred while fetching an resource:', err);

                                // We're rate-limited - let's retry after the suggested amount of time
-                               if (err.code === '20166') {
+                               if (err?.code === '20166') {
                                        let retryTime = (+!response.headers.get('Retry-After') * 1000) || 15000;
                                        console.log(`Rate limit exceeded, waiting for: ${retryTime} ms`);

However, this patch won't fix the entire problem.

Second, the importer can't import more than 600 pages (or more than 400 pages at worst) without reissuing the access token due to the API limit.

According to this page, an user can use the API at 400 req/hr. So if we handled the rate-limit properly, we can import 400 pages/hr. However, this another page says the lifetime of an access token Microsoft issues is a random value ranging between 60-90 minutes. So, we have to renew the token to import hundreds of pages (600+ pages). (We may use this page as a reference. But it seems we need a new scope offline_access)

obsidianmd / obsidian-importer

[OneNote] - HTTP 429 rate limitingduring importing #225