subject-f / cubarimoe

GNU Affero General Public License v3.0
60 stars 17 forks source link

Support for OneDrive folders #19

Open BrutuZ opened 1 year ago

BrutuZ commented 1 year ago

I currently have a stand-alone script to parse a OneDrive folder (via share link like https://1drv.ms/...) and generate a Cubari JSON, however it has a couple of downsides compared to a native proxy:

I'll leave the script below, it should be short and simple enough to be easily understood despite the lack of comments. Any chance OneDrive support could be integrated into Cubari based on it?

Script code ```python from base64 import urlsafe_b64encode from datetime import datetime from json import dumps as json_string from requests import get as get_url from sys import argv import re FOLDER_CONTENTS_URL = 'https://api.onedrive.com/v1.0/shares/u!{}/driveItem?$expand=children' FILE_CONTENTS_URL = 'https://api.onedrive.com/v1.0/shares/u!{}/root/content' def parse_folder(url: str) -> dict: folder = get_url(FOLDER_CONTENTS_URL.format(b64(url))).json() if not folder.get('children', []): print(f'Not a OneDrive folder - {url}') return # if __name__ == '__main__': print(json_string(folder, indent=2)) try: ctime = int( datetime.fromisoformat( folder.get('createdDateTime', '').replace('Z', '+00:00') ).timestamp() ) except ValueError: ctime = int(datetime.utcnow().timestamp()) title = folder.get('name') files = [] folders = [] for file in folder.get('children', []): if 'folder' in file: folders.append(file.get('webUrl')) elif 'file' in file and 'image' in file.get('file', {}).get('mimeType', ''): files.append( FILE_CONTENTS_URL.format( b64(file.get('webUrl')) or file.get('@content.downloadUrl') ) ) return {'title': title, 'date': ctime, 'files': files, 'folders': folders} def b64(onedrive_link: str) -> str: return str(urlsafe_b64encode(onedrive_link.encode()), 'utf-8').rstrip('=') if __name__ == '__main__': url = argv[1] if len(argv) > 1 else input('Folder share URL: ') print(url) chapters = {} api = parse_folder(url) gist = { 'title': api.get('title', ''), 'description': '', 'artist': '', 'author': '', 'cover': '', 'pages': 0, 'chapters': {} } if api.get('folders'): print("It's a folder! Recursing...") exp = re.compile( r'^(?:Ch\.? ?|Chapter )?0?([\d\.,]{1,5})(?: - )?', re.RegexFlag.IGNORECASE ) for folder in api['folders']: recurse = parse_folder(folder) search = re.search(exp, recurse['title']) if search: chapter = search.group(1) title = recurse['title'].replace(search.group(), '') else: chapter = str(folder.__index__) title = recurse['title'] gist['chapters'][chapter] = { 'title': title, 'last_updated': recurse['date'], 'groups': { 'OneDrive': recurse['files'] } } gist['pages'] += len(recurse['files']) if not gist['cover']: gist['cover'] = recurse.get('files', [])[0] else: gist['chapters']['1'] = { 'title': api.get('title', ''), 'last_updated': api.get('date'), 'groups': { 'OneDrive': api.get('files') } } gist['pages'] = len(api.get('files')) gist['cover'] = gist.get( 'cover', api.get('files', [''])[0] ) print(json_string(gist, indent=4)) ```
funkyhippo commented 1 year ago

I took a quick look at the OneDrive API and I was surprised there weren't any strict rate limits for unauthenticated calls. It's possible they exist, but it's hidden from the caller (how do they expect us to gracefully handle 429s?).

Regarding your points:

The JSON often becomes huge since it has to list each image individually and the OneDrive URLs are not exactly short. With a multiple chapters (assumes each sub-folder is a chapter) line-count can escalate quickly.

Our API responses are compressed so it's not so terrible for our users. For example, the OPM gist is 1.4MB raw but the API response is 123 KB.

It's not "live". If I modify something in the folder I have to rerun the script to manually update the URLs in the JSON.

I agree that this isn't ideal, but a workaround could be to write a cron or some sort of recurring script that checks and updates your gists. You can easily do this for free with GitHub Actions in public repos, and it's also how I keep the OPM gist up-to-date automatically.

I also have some concerns about PII being leaked from OneDrive share URLs; it includes the user's full name in the API response, and there's no easy way for us to warn the user of this fact.

BrutuZ commented 1 year ago

Beats me, they do own the infrastructure so it's not that shocking. As long as the rate is reasonable like for Gists that is also owned by them it should be fine.

I'm not fond of the idea of using a public repo, and since Cubari can't deal with private repos and GH doesn't allow secondary accounts (I tried), I resort to a secret gist instead as the best compromise. However it doesn't even allow folders, let alone GH Actions 😅. I also don't change files that often, it's just that when I do it takes longer to edit the JSON since I have to manually copy the image arrays. If I did it more frequently, would probably bother to write something to automate that too.

As for the PII concerns, that is valid. I believe an announcement in the page like the one for the git.io deprecation should be a visible enough disclaimer.

funkyhippo commented 1 year ago

GH doesn't allow secondary accounts (I tried)

Interesting, I haven't come across any issues having multiple GH accounts (this isn't my main account).

...However it doesn't even allow folders...

You can have multiple root-level files in a gist at least, so you can add to the same secret gist rather than creating a new one for each series.

It probably wouldn't be too difficult to set up a template repo that runs your script as a scheduled action as you can control the inputs as repo secrets. That way, anybody can fork it and change a couple secrets to bootstrap some sort of automation.

BrutuZ commented 1 year ago

Interesting, I haven't come across any issues having multiple GH accounts (this isn't my main account).

When I created a secondary account it was instantly marked as limited even before I created the first repo. Upon contacting GH support to have it unlocked I was informed that they couldn't because having multiple accounts was against the ToS, which made sense at the time due to dodging limitations on free accounts such as how many secret repos one could create.

You can have multiple root-level files in a gist at least, so you can add to the same secret gist rather than creating a new one for each series

I do have several files on that same gist. Not having folders just makes it "ugly" once you have a few script files tagging along hundreds of JSON.

I took a quick look at the OneDrive API and I was surprised there weren't any strict rate limits for unauthenticated calls. It's possible they exist, but it's hidden from the caller (how do they expect us to gracefully handle 429s?).

OneDrive provides a timer in the 429 response https://learn.microsoft.com/en-us/onedrive/developer/rest-api/?view=odsp-graph-online#throttling