jonorthwash / ud-annotatrix

GNU General Public License v3.0
63 stars 49 forks source link

Github integration #376

Open yaskevich opened 5 years ago

yaskevich commented 5 years ago

Supershort formulations of tasks:

It seems that code to interact with Github is already in the app code base. But there is no info, how user could start to use this feature, also it is not clear from which step development should continue.

jonorthwash commented 5 years ago

The first step is to implement loading of a corpus from a GitHub url.

yaskevich commented 5 years ago

Ok, I tested, it works. Maybe it is reasonable to set limit of size for downloadable corpus, since I found some issues related big files from GitHub (#377) .

jonorthwash commented 5 years ago

No, we allow whatever the user's system won't crash on, and file a bug against notatrix regarding inefficient parsing / memory usage.

jonorthwash commented 5 years ago

Ok, I tested, it works.

Which part? Was it already implemented??

yaskevich commented 5 years ago

Uploading a corpus from Github link was implemented before, and it works.

As to big files, there is notatrix problem (discussed there #377 and notatrix issue is there: https://github.com/keggsmurph21/notatrix/issues/6).

jonorthwash commented 5 years ago

Uploading a corpus from Github link was implemented before, and it works.

That's great, but the remaining pieces of this issue still need to be implemented.

yaskevich commented 5 years ago

I've managed to connect the app to Github. As I can see, there are some code parts that are related to Github, but it's more like a draft. When user's credentials are put in the app config and user clicks login button, OAuth authentication is initiated which leads to appearance of two menu items "Manage repositories" and "Manage permissions" (but they don't provide an interface to manage anything). Current goal is fork, next is PR. I separated functions of loading content from loading and forking.

2019-07-04 17_28_42-UD Annotatrix

As I understand, a full workflow could be like this:

yaskevich commented 5 years ago

Subtasks of Github integration related to corpus downloading, editing, pushing back, and preparing PR request (back end). Progress is checked.

Note for previous workflow description: it is not necessary to make a local repo to push the changes.

yaskevich commented 5 years ago

Previous comment is a tracker for the function set for interaction with Github API, if there is someone who would like to keep track of the development of the project, but didn't read this thread before.

yaskevich commented 5 years ago

Some thoughts for consideration on challenges related to interaction with files after some experience with Github API.

Filesize and Github limits

Github and its API are designed for dealing rather with small files (less than 1 MB).

Thus, it's easy to get a small file or push/commit it, it all smoothly goes via Content API . But it becomes more tricky, if it's about the files that are bigger than 1 MB. After I made it work, I had to change this way of interacting with Github to Tree/Blob API: so, it's like when I read a directory, look for a file I need, get its meta, and then fetch it as a blob and decode.

This API allows to process files which are bigger than 1 MB, but it's noticed: "This API supports blobs up to 100 megabytes in size." Although Github is unhappy when user pushes something bigger than 50 MB.

I use for testing purposes corpora from Universal Dependencies git, and I didn't meet any corpus that is larger than 100 MB. Generally, it seems that it's possible to deal with files which are bigger by means of Git Large File Storage (I hadn't had yet any experience with that thing). However, it would be an issue for Annotatrix.

As to updating a corpus from the app, the siuation is similar to data loading (easy for small files and tricky for bigger ones). So, pushing changes to Github in code is a sequence of queries to low-level API of Github. Technically, it's like compiling commit and replacing a whole tree, not just single request (but 7 requests).

As to files bigger than 100 MB, the only way could be a programmatic interaction with LFS, which is rather not documented, being super rare case for using Github via code.

Generally, it looks like the larger file size becomes common for UD users, the worse Github suits as a storage service. E.g., on limits for repos: "We recommend repositories be kept under 1GB each. Repositories have a hard limit of 100GB. If you reach 75GB you'll receive a warning from Git in your terminal when you push. This limit is easy to stay within if large files are kept out of the repository. If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down."

We have to keep those things in mind.