Closed mberr closed 2 years ago
@mali-git (and anyone running experiments): what do you think about that?
Sounds good! I will create an issue.
This is an issue for exactly this reason :sweat_smile:
I created a related one in the other repo, so that we don't forget to extend the current implementation ;)
I think we won't be able to store all of the artifacts on git, even with LFS and compression :/
Will try and solve with mali-git/POEM_develop#476 and mali-git/POEM_develop#477
@mali-git @cthoyt a general question:
Should we really make all of the storage options part of the library? This will blow up the dependencies, and we won't be able to cover all special cases anyway.
For FTP I suppose you can setup your mounts to mount a FTP folder somewhere, or just move the file after it has been saved to disk. This would also avoid the credential problems, as e.g. for SFTP the .ssh/config
will be used.
@mali-git @cthoyt a general question:
Should we really make all of the storage options part of the library? This will blow up the dependencies, and we won't be able to cover all special cases anyway.
For FTP I suppose you can setup your mounts to mount a FTP folder somewhere, or just move the file after it has been saved to disk. This would also avoid the credential problems, as e.g. for SFTP the
.ssh/config
will be used.
I have actually the same concerns as @mberr.
ftplib
is part of the standard library, so that's not really an issue either way.boto3
can be given as an optional requirement because it's imported inside the functions where it's used. Usually this is bad practice, but the need to make some imports optional is a good exception.We shouldn't underestimate the ubiquity of cloud-based services in machine learning. Saying we support working with AWS out of the box is a huge selling point for PyKEEN that other KGE (and in general, ML) libraries don't have.
We could even consider other options later. I already have in mind a way to generalize this through classes (adapter pattern)
Before we publish the paper, I think archiving everything to zenodo would be a good idea. Where are all of the trained models living right now?
Also, how big are they? Like hundreds of GB? or not that much? if we can get them all fitting on my computer which has 560GB free, then I can take care of the zenodo stuff
Some of them are living in exile on our chair's webserver.
@cthoyt I currently have the following directories hosted via our webserver
2020-02-24-01-42-11_trouillon2016_complex_wn18
2020-02-24-06-43-53_trouillon2016_complex_fb15k
2020-02-24-07-30-01_yang2014_distmult_wn18
2020-02-24-07-31-17_yang2014_distmult_fb15k
2020-02-24-08-14-24_nickel2016_hole_wn18
2020-02-25-18-22-08_he2015_kg2e_wn18
2020-02-26-13-37-04_bordes2013_transe_fb15k
2020-02-27-00-15-32_bordes2013_transe_wn18
2020-02-27-11-42-02
2020-02-27-14-15-22_he2015_kg2e_fb15k
2020-02-29-14-11-53_sun2019_rotate_wn18
2020-03-01-10-18-31_sun2019_rotate_fb15k
2020-03-01-13-41-37_kazemi2018_simple_wn18
2020-03-01-17-43-25_kazemi2018_simple_fb15k
2020-03-02-00-24-39_ji2015_transd_wn18
2020-03-02-09-32-29_wang2014_transh_wn18
2020-03-02-10-57-11_wang2014_transh_fb15k
2020-03-03-05-40-01_ji2015_transd_fb15k
2020-03-03-09-02-22_li2015_transr_wn18
2020-03-03-10-08-46_li2015_transr_fb15k
2020-03-04-13-19-43
2020-03-05-02-49-56
2020-03-05-05-49-39
2020-03-05-10-29-05
2020-03-05-18-44-47
2020-03-06-03-26-03
2020-03-06-15-39-44
2020-03-09-14-02-15_convkb_wn18rr
2020-03-09-15-43-24
The consume 53GiB in total.
We can use git-lfs [0] to include the weights of the trained model in this repository.
For HPO I would suggest to only save the best model; for reproducibility, all models are of interest.
[0] https://help.github.com/en/github/managing-large-files/configuring-git-large-file-storage