obsidianmd / obsidian-importer

Obsidian Importer lets you import notes from other apps and file formats into your Obsidian vault.
https://help.obsidian.md/import
MIT License
766 stars 74 forks source link

[OneNote] - HTTP 429 rate limitingduring importing #225

Open kquinsland opened 7 months ago

kquinsland commented 7 months ago

I am trying to import a particular OneNote notebook and am unable to do so w/o hitting rate-limit errors.

In the user interface, the rate-limit errors are manifesting themselves like so:

Failed: "Exported image 20240329083022-5.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083024-2.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083024-3.png" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083024-2.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083024-3.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "2019-2-21" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "Exported image 20240329083025-0.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "2019-2-22" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "2019-2-23" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "2019-2-24" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "2019-2-25" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "2019-2-26" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "Exported image 20240329083024-5.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "2019-2-28" because TypeError: Cannot read properties of undefined (reading 'split')
Failed: "Exported image 20240329083027-0.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083027-1.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...
Failed: "Exported image 20240329083027-2.jpeg" because TypeError [ERR_INVALID_ARG_TYPE]: The first argument must be of type string or an instance of Buffer...

But in the dev-tools console, the underlying 429 shows up:

(Not sure why I can't copy stack trace)

Screenshot 2024-03-29 083749


Other details:

kepano commented 7 months ago

@p3rid0t do you have an idea of what might be going on? Potentially related to #120

p3rid0t commented 7 months ago

This is weird... 🤔

The function detected the error (noticeable by the message 'An error has occurred while fetching [...]', but the if statement that's right after it which checks for error code 20166, is not executed. I'll try to follow up soon

kquinsland commented 7 months ago

The function detected the error (noticeable by the message 'An error has occurred while fetching [...]', but the if statement that's right after it which checks for error code 20166, is not executed. I'll try to follow up soon

Take a closer look at the object in the console. I hot-patched it:

                    s.error.code === "20166") {
                        let l = +!o.headers.get("Retry-After") * 1e3 || 15e3;
                        if (console.log(`Rate limit exceeded, waiting for: ${l} ms`),
                        r < a1)
                            return await new Promise(c=>setTimeout(c, l)),
                            this.fetchResource(e, i, r + 1);
                        throw new Error("Exceeded maximum retry attempts")
                    }

I'm not a JS dev nor do I have much time to set up an env so I'm just patching the minified file in .obsidian/plugins... and reloading obsidian

After the patch, console looks promising:

(why the hell can't I right click, copy in the obsidian dev tools console!?... i can do that on ff/chrome dev tools console :( )

image

kquinsland commented 7 months ago

Hmm. I was a bit too optimistic after the change and managed to get my self put back in the cool-down bucket.

I can't even list notebooks now (this is the same behavior I was getting last night when trying to import everything...)

image

And here's the kicker:

image

Note that there is no Retry-After header sent by microsoft?

And if I'm reading this right:

let l = +!o.headers.get("Retry-After") * 1e3 || 15e3;

Missing retry-after header will mean 1000 ms because you're negating what will probably be null or undefined or whatever .get() does when the header isn't present?

The docs suggest that the Retry-After header won't be present in at least some cases:

The resources listed earlier do not return a Retry-After header on 429 Too Many Requests responses.

So the fix here is to bump up the number of max retry attempts and to exponentially wait longer when retry-after is not present?

kquinsland commented 7 months ago

I figured out how to get the un-minified JS coded into the plugin dir so I could start hacking on it, came up with this:

if (s.error.code === "20166") {
    let retryAfter = o.headers.has("Retry-After") ? +o.headers.get("Retry-After") * 1000 : null;

    if (!retryAfter) {
        console.warn("Rate limit exceeded. Retry-After header not present.");
    }

    // Exponential backoff
    let exponentialBackoff = Math.pow(2, r) * 1000; // Exponential backoff in milliseconds

    let waitTime = retryAfter || exponentialBackoff;

    console.log(`Rate limit exceeded, waiting for: ${waitTime} ms`);

    if (r < a1) {
        await new Promise(resolve => setTimeout(resolve, waitTime));
        return await this.fetchResource(e, i, r + 1);
    } else {
        throw new Error("Exceeded maximum retry attempts");
    }
}

but it looks like I need to spend more time cooling down as I can't even finish logging in / enumerating stuff to fetch:

image

kquinsland commented 7 months ago

After cooling down a bit, I was able to complete the log in (side note: is there an open issue to improve that functionality so auth can persist?) and found that I never got a retry-after header... which makes sense in light of the docs linked above.

Given that the header in question likely won't be present for the majority of requests / cases, I'm at a bit of a loss as to how to "fix" this:

I've not looked at the API too closely but it seems like the correct way would be to get a full section/sub-section at a time instead of one request for each page in a given section/sub-section? Assuming there's a reliable delimiter for each "page" that would get returned in that one request, it should be easy-ish to just split the payload on that delim and feed it in to the current parse/extract/save code?

Alternatively, persist auth details so it's (slightly) less of a PITA to launch the import flow and select a sub-section at a time?

Not clear to me if the plugin looks to see if a file exists on disk or not before making the request for a given page. E.g: let's say that I select sub sections "08" and "09" from the "2005" section for a particular notebook and I start hitting rate limits on 2005/09/20.md. I'll power down computer for a while to let the request buckets empty before trying again. When finally able to start again, I'll end up burning some of the precious requests on files that already exist?

RazgovovEvgen22 commented 4 months ago

Not clear to me if the plugin looks to see if a file exists on disk or not before making the request for a given page.

I haven’t looked at the code, but when I resynchronize, duplicate files are created with an additional number at the end of the name

s4kutai commented 3 months ago

Hello. I faced with the same situation. I tried importing 600+ pages from OneNote. However, the 429 error occurred. I looked for the solution and there are two things I'd like to say as a result.

First, as @kquinsland hot-pached, the importer's rate-limit handling seems broken. The importer expects the error response like:

{
    "code": "20166",
    ...
}

However, the actual response is like:

{
    "error": {
        "code": "20166",
        ...
    }
}

So, we can hot-patch this behavior with the patch like this:


diff --git a/src/formats/onenote.ts b/src/formats/onenote.ts
index fefc278..31ecbc6 100644
--- a/src/formats/onenote.ts
+++ b/src/formats/onenote.ts
@@ -697,12 +697,13 @@ export class OneNoteImporter extends FormatImporter {
                                }
                        }
                        else {
-                               const err: PublicError = await response.json();
+                               const err_raw = await response.json();
+                               const err: PublicError | undefined = err_raw.error;

                                console.log('An error has occurred while fetching an resource:', err);

                                // We're rate-limited - let's retry after the suggested amount of time
-                               if (err.code === '20166') {
+                               if (err?.code === '20166') {
                                        let retryTime = (+!response.headers.get('Retry-After') * 1000) || 15000;
                                        console.log(`Rate limit exceeded, waiting for: ${retryTime} ms`);

However, this patch won't fix the entire problem.

Second, the importer can't import more than 600 pages (or more than 400 pages at worst) without reissuing the access token due to the API limit.

According to this page, an user can use the API at 400 req/hr. So if we handled the rate-limit properly, we can import 400 pages/hr. However, this another page says the lifetime of an access token Microsoft issues is a random value ranging between 60-90 minutes. So, we have to renew the token to import hundreds of pages (600+ pages). (We may use this page as a reference. But it seems we need a new scope offline_access)