pykeen / benchmarking

📊 Results from the reproducibility and benchmarking studies presented in "Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework" (http://arxiv.org/abs/2006.13365)
MIT License
35 stars 4 forks source link

Upload model files via git LFS #5

Closed mberr closed 2 years ago

mberr commented 4 years ago

We can use git-lfs [0] to include the weights of the trained model in this repository.

For HPO I would suggest to only save the best model; for reproducibility, all models are of interest.

[0] https://help.github.com/en/github/managing-large-files/configuring-git-large-file-storage

mberr commented 4 years ago

@mali-git (and anyone running experiments): what do you think about that?

mali-git commented 4 years ago

Sounds good! I will create an issue.

mberr commented 4 years ago

This is an issue for exactly this reason :sweat_smile:

mali-git commented 4 years ago

I created a related one in the other repo, so that we don't forget to extend the current implementation ;)

cthoyt commented 4 years ago

I think we won't be able to store all of the artifacts on git, even with LFS and compression :/

Will try and solve with mali-git/POEM_develop#476 and mali-git/POEM_develop#477

mberr commented 4 years ago

@mali-git @cthoyt a general question:

Should we really make all of the storage options part of the library? This will blow up the dependencies, and we won't be able to cover all special cases anyway.

For FTP I suppose you can setup your mounts to mount a FTP folder somewhere, or just move the file after it has been saved to disk. This would also avoid the credential problems, as e.g. for SFTP the .ssh/config will be used.

mali-git commented 4 years ago

@mali-git @cthoyt a general question:

Should we really make all of the storage options part of the library? This will blow up the dependencies, and we won't be able to cover all special cases anyway.

For FTP I suppose you can setup your mounts to mount a FTP folder somewhere, or just move the file after it has been saved to disk. This would also avoid the credential problems, as e.g. for SFTP the .ssh/config will be used.

I have actually the same concerns as @mberr.

cthoyt commented 4 years ago

We shouldn't underestimate the ubiquity of cloud-based services in machine learning. Saying we support working with AWS out of the box is a huge selling point for PyKEEN that other KGE (and in general, ML) libraries don't have.

We could even consider other options later. I already have in mind a way to generalize this through classes (adapter pattern)

cthoyt commented 4 years ago

Before we publish the paper, I think archiving everything to zenodo would be a good idea. Where are all of the trained models living right now?

Also, how big are they? Like hundreds of GB? or not that much? if we can get them all fitting on my computer which has 560GB free, then I can take care of the zenodo stuff

mberr commented 4 years ago

Some of them are living in exile on our chair's webserver.

mberr commented 4 years ago

@cthoyt I currently have the following directories hosted via our webserver

2020-02-24-01-42-11_trouillon2016_complex_wn18
2020-02-24-06-43-53_trouillon2016_complex_fb15k
2020-02-24-07-30-01_yang2014_distmult_wn18
2020-02-24-07-31-17_yang2014_distmult_fb15k
2020-02-24-08-14-24_nickel2016_hole_wn18
2020-02-25-18-22-08_he2015_kg2e_wn18
2020-02-26-13-37-04_bordes2013_transe_fb15k
2020-02-27-00-15-32_bordes2013_transe_wn18
2020-02-27-11-42-02
2020-02-27-14-15-22_he2015_kg2e_fb15k
2020-02-29-14-11-53_sun2019_rotate_wn18
2020-03-01-10-18-31_sun2019_rotate_fb15k
2020-03-01-13-41-37_kazemi2018_simple_wn18
2020-03-01-17-43-25_kazemi2018_simple_fb15k
2020-03-02-00-24-39_ji2015_transd_wn18
2020-03-02-09-32-29_wang2014_transh_wn18
2020-03-02-10-57-11_wang2014_transh_fb15k
2020-03-03-05-40-01_ji2015_transd_fb15k
2020-03-03-09-02-22_li2015_transr_wn18
2020-03-03-10-08-46_li2015_transr_fb15k
2020-03-04-13-19-43
2020-03-05-02-49-56
2020-03-05-05-49-39
2020-03-05-10-29-05
2020-03-05-18-44-47
2020-03-06-03-26-03
2020-03-06-15-39-44
2020-03-09-14-02-15_convkb_wn18rr
2020-03-09-15-43-24

The consume 53GiB in total.