Open yaskevich opened 5 years ago
The first step is to implement loading of a corpus from a GitHub url.
Ok, I tested, it works. Maybe it is reasonable to set limit of size for downloadable corpus, since I found some issues related big files from GitHub (#377) .
No, we allow whatever the user's system won't crash on, and file a bug against notatrix regarding inefficient parsing / memory usage.
Ok, I tested, it works.
Which part? Was it already implemented??
Uploading a corpus from Github link was implemented before, and it works.
As to big files, there is notatrix problem (discussed there #377 and notatrix issue is there: https://github.com/keggsmurph21/notatrix/issues/6).
Uploading a corpus from Github link was implemented before, and it works.
That's great, but the remaining pieces of this issue still need to be implemented.
I've managed to connect the app to Github. As I can see, there are some code parts that are related to Github, but it's more like a draft. When user's credentials are put in the app config and user clicks login button, OAuth authentication is initiated which leads to appearance of two menu items "Manage repositories" and "Manage permissions" (but they don't provide an interface to manage anything). Current goal is fork, next is PR. I separated functions of loading content from loading and forking.
As I understand, a full workflow could be like this:
Subtasks of Github integration related to corpus downloading, editing, pushing back, and preparing PR request (back end). Progress is checked.
request
→ axios
(actively maintained, supports async/await & promises interfaces)Note for previous workflow description: it is not necessary to make a local repo to push the changes.
Previous comment is a tracker for the function set for interaction with Github API, if there is someone who would like to keep track of the development of the project, but didn't read this thread before.
Some thoughts for consideration on challenges related to interaction with files after some experience with Github API.
Github and its API are designed for dealing rather with small files (less than 1 MB).
Thus, it's easy to get a small file or push/commit it, it all smoothly goes via Content API . But it becomes more tricky, if it's about the files that are bigger than 1 MB. After I made it work, I had to change this way of interacting with Github to Tree/Blob API: so, it's like when I read a directory, look for a file I need, get its meta, and then fetch it as a blob and decode.
This API allows to process files which are bigger than 1 MB, but it's noticed: "This API supports blobs up to 100 megabytes in size." Although Github is unhappy when user pushes something bigger than 50 MB.
I use for testing purposes corpora from Universal Dependencies git, and I didn't meet any corpus that is larger than 100 MB. Generally, it seems that it's possible to deal with files which are bigger by means of Git Large File Storage (I hadn't had yet any experience with that thing). However, it would be an issue for Annotatrix.
As to updating a corpus from the app, the siuation is similar to data loading (easy for small files and tricky for bigger ones). So, pushing changes to Github in code is a sequence of queries to low-level API of Github. Technically, it's like compiling commit and replacing a whole tree, not just single request (but 7 requests).
As to files bigger than 100 MB, the only way could be a programmatic interaction with LFS, which is rather not documented, being super rare case for using Github via code.
Generally, it looks like the larger file size becomes common for UD users, the worse Github suits as a storage service. E.g., on limits for repos: "We recommend repositories be kept under 1GB each. Repositories have a hard limit of 100GB. If you reach 75GB you'll receive a warning from Git in your terminal when you push. This limit is easy to stay within if large files are kept out of the repository. If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down."
We have to keep those things in mind.
Supershort formulations of tasks:
It seems that code to interact with Github is already in the app code base. But there is no info, how user could start to use this feature, also it is not clear from which step development should continue.