latitudegames / AIDungeon

Infinite adventures await!
http://www.aidungeon.io/
MIT License
3.18k stars 555 forks source link

High hosting cost - 5 GB model #41

Open ghost opened 4 years ago

ghost commented 4 years ago

Making an issue for this. I'm busy at the moment and I'm fighting with getting the python dependencies installed on the Ubuntu subsystem on windows, but can spare a few minutes so to create this issue.

Some thoughts -

  1. Is the notebook downloading the 5 GB model every time it runs, or does it cache the download for the user? I.E. is the download a one-time thing or an every-time thing? Sorry I'm not familiar with how those work. If it's doing it every run, I'd take that down until a better solution is presented for mainstream users.
  2. One option might be to shard the file up, split the shards onto their own github repos, then combine then back after downloading them. Most other hosting solutions have costs for egress but github does not. I'm sure there's other options but this is certainly something I could throw together. Github has a size cap of 1 GB and they will complain after your repo hits 100 MB, so it'd be quite a few shards but would be manageable with some scripting. Not sure if it's violating any TOS.
  3. Could also host using BitTorrent.
nickwalton commented 4 years ago

It should only download it once. I think this issue should be mostly resolved though as I set up google cloud CDN which should (hopefully) get rid of the high international egress costs from North America to colab servers in other countries.

nickwalton commented 4 years ago

Nevermind. The costs are still super high. I'm going to have to shut off bucket access for now.

kylemiller3 commented 4 years ago

Would it be possible to host the file via BitTorrent and download with the tools available on Google's end?

Akababa commented 4 years ago

Is there a way to use "Mount drive" in Colab to help? https://stackoverflow.com/questions/53576555/share-a-part-of-google-drive-on-colab

nickwalton commented 4 years ago

Apparently it might be possible to download from my drive to another persons without worrying about download fees. That might work as a temporary solution.

On Sat, Dec 7, 2019 at 9:38 PM Michael Pang notifications@github.com wrote:

Is there a way to use "Mount drive" in Colab to help?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nickwalton/AIDungeon/issues/41?email_source=notifications&email_token=AFJNOQBZYWYUWPTDUUSGE2DQXR25XA5CNFSM4JXV3VMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGGVYXI#issuecomment-562912349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJNOQDLMY2XYWTQLJ4LZBTQXR25XANCNFSM4JXV3VMA .

JeffreyBenjaminBrown commented 4 years ago

The first time it borked for me, I closed the page and reopened it. A few other times I force-restarted. Later I discovered that I could select "restart and run all" and it would go a lot faster, not having to re-download everything.

If that was in the README it might save you some money and/or your users some time.

ghost commented 4 years ago

I don't see anything in the github TOS that says we can't just shard the file up and host each shard on its own repo.

synap5e commented 4 years ago

Github sets Content-Type: text/plain to prevent being used as a CDN. I know that doesn't prevent this use case, but I'd be careful of using them as a CDN regardless. Even if not in the TOS, they may not be happy with this use and could ask you to stop. There are sites set up to act as a CDN to GH however, such as https://raw.githack.com/, although I believe these are intended for .js, .html, images etc. and not bulk data.

If the gdrive option doesn't work out, the cheapest option (without running afoul of GH or other services) might be to run a bunch of VPSes or a dedi with lots of (ideally unlimited) bandwidth and set up a caching layer on that. It's not too hard to find servers under $50/mo with unlimited bandwidth.

ghost commented 4 years ago

I was thinking of the sharding solution as being temporary. That said this whole thing with running out of a colab notebook might be temporary as well as I'm sure google didn't consider the use-case of someone using it as a game engine.

I think putting this on BitTorrent makes a heck of a lot of sense and is a fantastic use case for it (legal file hosting... waaaaah?). Until there's enough seeds though it wouldn't hurt to have an alternative. Also, I'm sure many universities block BitTorrent, or at least they did when I was in college.

ghost commented 4 years ago

Alright, I wrote a script to upload the shards to github and it's running now. Here's the first shard: https://github.com/JamesHutchison/aidungeon2-model-550-1-of-84

Just change 1 to whatever number to get the remaining shards. The upload is done once shard 84 has a file in it. You can recombine them by checking out all the repos in their own directories and doing cat */x* > model-550.data-00000-of-00001 (assuming the only files / directories in the cwd are the repo directories). Note that this is only the model data, the other files that are downloaded could simply be commited to this repo as they aren't that big.

I haven't tested the files to see if they get corrupted in the process. I don't think my git upload is configured to change line-endings but its possible that might happen. I'm going to bed and will probably test sometime tomorrow, if someone doesn't beat me to it.

As of this post its on shard 10 of 84

nepeat commented 4 years ago

Just found the torrent you have hosted on the latest commit. Currently seeding but for others reading this, the magnet is below here, so you do not have to download it from S3 and burn bandwidth.

Torrent

model_v5.torrent.zip

Magnet

magnet:?xt=urn:btih:b343b83b35bff774dab13e0281ce13b3daf37d3e&dn=model%5Fv5&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.pomf.se%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Fp4p.arenabg.com%3A1337%2Fannounce&tr=udp%3A%2F%2F9.rarbg.me%3A2710%2Fannounce&tr=udp%3A%2F%2F9.rarbg.to%3A2710%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.tiny-vps.com%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Fdenis.stalker.upeer.me%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce&tr=udp%3A%2F%2Ftracker.moeking.me%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.cyberia.is%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.demonii.si%3A1337%2Fannounce&tr=udp%3A%2F%2Fipv4.tracker.harry.lu%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker3.itzmx.com%3A6961%2Fannounce&tr=udp%3A%2F%2Fzephir.monocul.us%3A6969%2Fannounce&tr=udp%3A%2F%2Fxxxtor.com%3A2710%2Fannounce
louisgv commented 4 years ago

I think it should be fine to show a prompt telling user to grab the model via torrent instead of trying to to download it in the installation script. It will also help with seeding as well, since volunteer will keep it seeding forever. And then they can make a copy of the Google Collab notebook, upload the model into their personal drive and play it from there. It would be a bit of work but more sustainable imo.

MrKrzYch00 commented 4 years ago

I will keep seeding the torrent on 200/200mbps. I know a lot of people may be turned off by waiting time so I hope it helps at least a bit. This masterpiece deserves people's attention.

EDIT: Almost 4 hours later, U/L ratio: 41.

ekmett commented 4 years ago

@JamesHutchison Your cat instructions will break because you didn't prefix the single digit numbers with a 0.

$ ls */x*
aidungeon2-model-550-1-of-84/xaa
aidungeon2-model-550-10-of-84/xaj
aidungeon2-model-550-11-of-84/xak
aidungeon2-model-550-12-of-84/xal
aidungeon2-model-550-13-of-84/xam
aidungeon2-model-550-14-of-84/xan
aidungeon2-model-550-15-of-84/xao
aidungeon2-model-550-16-of-84/xap
aidungeon2-model-550-17-of-84/xaq
aidungeon2-model-550-18-of-84/xar
aidungeon2-model-550-19-of-84/xas
aidungeon2-model-550-2-of-84/xab
aidungeon2-model-550-20-of-84/xat
aidungeon2-model-550-21-of-84/xau
aidungeon2-model-550-22-of-84/xav
...

means cat will assemble them in the wrong order.

Some judious moving will fix the assembly order:

$ for f in aidungeon2-model-550-?-of-84; do mv $f aidungeon2-model-550-0${f#aidungeon2-model-550-}; done

(I'm sure there is a better syntax for that, but I'm tired.)

ghost commented 4 years ago

ah good point. I was just checking on this and the md5 wasn't matching, that might be why. Sorry I was throwing together instructions without the time to test them.

ghost commented 4 years ago

and TIL that in github to rename a repo you have to click both the rename button and hit enter to actually rename

ghost commented 4 years ago

Alright, here's a script that pulls from the repos and rebuilds the model file. I'm pretty busy today and can't really spend time to clean this up a little bit. This just generates the model file. Copying it to the correct location, or even the script working on windows, is missing. if you are a windows user you can either execute this in cygwin or the ubuntu subsystem, or update the script so that the call to cat is replaced with what I would imagine would be a glob followed by reading the files and writing them to an output file. The md5 calculation would need to be moved to a pure python implementation, using hashlib I would imagine.

Edit: updated to use a zip file now that contains all the model files

import os

import subprocess

FILENAME = 'model-550.zip'
repo_template = 'aidungeon2-model-550-zip-{shard}-of-78'

def clone_repos():
    for i in range(84):
        shard = "%02d" % (i + 1)
        repo_name = repo_template.format(shard=shard)
        url = "https://github.com/JamesHutchison/{repo_name}".format(repo_name=repo_name)
        os.system('git clone %s' % url)

def rebuild_model():
    os.system('cat */x* > %s' % FILENAME)

def check_md5():
    expected_md5 = 'cb07f8fcecea5c3a418533296cbd088d'
    output = subprocess.check_output(['md5sum', FILENAME])
    actual_md5 = output.strip().split()[0]
    print("Expected md5 of %s got %s" % (expected_md5, actual_md5))
    assert actual_md5 == expected_md5

clone_repos()
rebuild_model()
check_md5()

I'm skeptical this is going to be the preferred method but at least we have another alternative if the other methods aren't working for some reason

TheReal1604 commented 4 years ago

@nickwalton i also like to support this project with free bandwith. Seeding the magnet link @nepeat posted with 500mbit/s.

EDIT: @MrKrzYch00 got a ratio of 342~ now. :grin:

ghost commented 4 years ago

when I get time later I'm going to redo the github sharded file to be a zip containing the same files as the torrent download. Better to keep things consistent and file size will be a little smaller.

MrKrzYch00 commented 4 years ago

I think the situation with torrent overload has been resolved. Back then there were almost 700 peers which was slowly reduced to ~400-450 before @TheReal1604's help (and anyone else from non-aria clients that kept seeding). Most likely aria command line update helped as well. We are down to ~80 peers and download speeds are very good, mostly ~5-25MiB/s.

ghost commented 4 years ago

Updated the code block above to point to the new zip repo that contains the same files as the torrent. Just need to unzip to the right place after getting the file.

sbrichardson commented 4 years ago

Seeding the torrent files currently, I saw about 25-50 MB/s download, I'm seeding on a 1Gb/s link. Appreciate your work!

arshem commented 4 years ago

Seeding on my seedbox. 1gbps link as well. this is absolutely awesome! Thanks for sharing!

szepeviktor commented 4 years ago

@nickwalton Please see https://www.feralhosting.com/pricing

ZerxXxes commented 4 years ago

I added the v5 model to IPFS so that it now can be reached via all IPFS public gateways such as Cloudflares: https://cloudflare-ipfs.com/ipfs/QmRkuYGhAcNFz9FZq3xEFduXyihbUwmbhPbuakjhn9SRVQ

exotime commented 4 years ago

Seeding on Gigabit for the forseeable future too.

JKamsker commented 4 years ago

I am also seeding with 1gbit for 8 hours straight now. (radio is x200 now :P)

If im allowed to ask: what were the costs of the hosted storage? @JamesHutchison @nickwalton

nickwalton commented 4 years ago

We still owe Google $30k...

On Wed, Dec 11, 2019, 9:36 AM Jonas Kamsker notifications@github.com wrote:

I am also seeding with 1gbit for 8 hours straight now. (radio is x200 now :P)

If im allowed to ask: what were the costs of the hosted storage? @JamesHutchison https://github.com/JamesHutchison @nickwalton https://github.com/nickwalton

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AIDungeon/AIDungeon/issues/41?email_source=notifications&email_token=AFJNOQDGVC6N5HXT56TTPQLQYEJINA5CNFSM4JXV3VMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGTYSVQ#issuecomment-564627798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJNOQHYE7E6HPP4RTUEOJLQYEJINANCNFSM4JXV3VMA .

karibuTW commented 4 years ago

Hello guys, Do you need to host those on a different servers? I have few dedicated servers either in France or in Vietnam. I'll be happy to provide space and bandwidth to support the project.

JKamsker commented 4 years ago

Hello guys, Do you need to host those on a different servers? I have few dedicated servers either in France or in Vietnam. I'll be happy to provide space and bandwidth to support the project.

Just seed the torrent, i think that might help the most :)

JKamsker commented 4 years ago

@JamesHutchison Another idea came to my mind is to upload the model to a google drive and share the link. I don't know exactly if its possible via API but there is a function which copies public shared files to your own drive. That bypasses the download limit of shared files.

karibuTW commented 4 years ago

Hello guys, Do you need to host those on a different servers? I have few dedicated servers either in France or in Vietnam. I'll be happy to provide space and bandwidth to support the project.

Just seed the torrent, i think that might help the most :)

Okay, i have added 3 dedicated servers on it. 200m in Vietnam, 100M in France and 1G in France, seeding h24.

I am actually surprised to see my 1G server uploading at 50mb/s at the moment. Quite some demand indeed! I have 2 more dedicated servers on 100M, but I feel users expect Gigabit connections now haha

louisgv commented 4 years ago

@JamesHutchison Another idea came to my mind is to upload the model to a google drive and share the link. I don't know exactly if its possible via API but there is a function which copies public shared files to your own drive. That bypasses the download limit of shared files.

Yup, the notebook on develop branch has utilities function that will allow you to do just that!

ben-bay commented 4 years ago

@JamesHutchison Another idea came to my mind is to upload the model to a google drive and share the link. I don't know exactly if its possible via API but there is a function which copies public shared files to your own drive. That bypasses the download limit of shared files.

Yup, the notebook on develop branch has utilities function that will allow you to do just that!

Hoping to move it all over to master soon!

taliptako commented 4 years ago

@JamesHutchison Why not just share it as a release they don't have bandwidth limit https://help.github.com/en/github/administering-a-repository/about-releases#limitations-on-binary-files

We don't limit the total size of your binary release files, nor the bandwidth used to deliver them. However, each individual file must be under 2 GB in size.

We just have to split them and upload then persons who download just need to join them to use

MrKrzYch00 commented 4 years ago

I will keep the model hosted at http://virtual.4my.eu/AIdungeon2/ for people having trouble getting the torrent (blocked ports or other reasons). The uplink may not be that great so for speed torrent may still be recommended.