Closed petebankhead closed 1 year ago
The LFS pointer file containing the SHA256 is already stored locally, would just need to cache the model zoo file.
The question is when should we update it? Should we store a version with the extension, and add a button to fetch the online one? Should we use the cached version, but grab the online one if we have an internet connection?
The question is when should we update it? Should we store a version with the extension, and add a button to fetch the online one? Should we use the cached version, but grab the online one if we have an internet connection?
good question... i personally lean towards the second option -- use the cached option but grab the online one if internet is available. this is an aggressive update strategy, but i would prefer it because models may be updated to that zoo at any time. with an aggressive strategy, new models would be available asap.
what do others think?
I think that's by far and away the most sensible option, that way we're prioritising updates but still allowing people to use the extension offline as much as possible.
Actually currently, we're a little defensive about allowing model updates. If there's a valid cached version, we don't check if it matches the upstream. This means if a new version of one of the models is pushed to main
, then users would have to manually delete the cached version and re-download.
Would be better to flag that updates are available. Maybe go back to checking SHAs match when we're online?
Would it be straightforward to check if the shas match when we’re online? I would prefer that, because as you wrote as well, a model could be updated in the meantime.As an aside, this is one limitation of the “main” revision. It’s a moving target. But one is also able to tag a static revision and set a static revision in the model zoo json file. Best,Jakub
what do others think?
Need to caveat this with the fact I don't know enough about the hugging face API and didn't write the part that queries it...
My guess is that people will want/need control over which version of the model is being run. If there's an update, they mustn't receive that automatically and without warning - rather, they ought to be able to select which revision is used. Otherwise they risk running multiple models across large datasets, while thinking that they are always running the same one.
With that in mind, I was kind of assuming that, if you've got a model and the SHA matches, then that is permanently relevant - and another other revision needs to be recognizably a different model (either via the name, or some other version identifier).
I guess this relates to the 'static version' thing, but I don't have a clear picture in my head how this all does/should fit together... I suspect that the QuPath extension (at least) isn't handling model versioning in any explicit way - although @alanocallaghan may know otherwise.
As far as I can tell hugging face is basically a git repo. At the moment we point to main
which is a moving target, so technically tomorrow all the models could be updated, and the versions that I downloaded yesterday could be completely different to the versions downloaded by a new user, which seems like a problem.
Using a commit SHA or tag in the zoo JSON would definitely be easier to deal with - at least then we could just cache the Zoo file, download a new one if we can, and flag any new models or new versions to users
I think our current behaviour is fine until the models are updated. Whenever that happens, I think the model Zoo on main should get tags rather than using a moving target. Then, we can offer users an updated Zoo file if one is available, but allow people to stay on the old version if that's what they prefer.
How will that be presented to the user? Will the choice list just grow indefinitely, or do we have/need some other way?
Interesting question. Maybe the zoo files could also be released under tags, and then we populate a dropdown of available tags? Because it'd be best to ensure people can reproduce the versions other people are using. In that context the choice list would just grow indefinitely, but I think that's probably fine, as long as it doesn't get too long
In that context the choice list would just grow indefinitely, but I think that's probably fine, as long as it doesn't get too long
Well, I've just tried and it looks like SearchableComboBox
may work as a drop-in replacement for ChoiceBox
, making it not too bad even if it does get very long.
I updated the fxml and controller class and it all looks well (so far anyway).
Made PR in case you want to check it out.
So there are (at least) two ways of handling model versions:
"pancancer-lymphocytes-inceptionv4.tcga": { "description": "Pancancer lymphocytes", "hf_repo_id": "kaczmarj/pancancer-lymphocytes-inceptionv4.tcga", "hf_revision": "main" }
"pancancer-lymphocytes-inceptionv4.tcga": { "description": "Pancancer lymphocytes", "hf_repo_id": "kaczmarj/pancancer-lymphocytes-inceptionv4.tcga", "hf_revision": "v1.0.0" }
Models are never removed from this zoo. So users get one model zoo that contains all versions of all models.In my view,
@kaczmarj thoughts?
Maybe the zoo files could also be released under tags, and then we populate a dropdown of available tags?
yes, i absolutely agree! i will create tags in the repos and update the model-zoo json file to use the tags.
@kaczmarj thoughts?
i would prefer option 1, where we point to every new version of a model. and to support multiple versions, perhaps we could do something as simple as append a v1
suffix to the name of the model (or v2
, v3
, etc).
for example:
"pancancer-lymphocytes-inceptionv4.tcga-v1.0.0": { "description": "Pancancer lymphocytes", "hf_repo_id": "kaczmarj/pancancer-lymphocytes-inceptionv4.tcga", "hf_revision": "v1.0.0" }
is that hideous? i would want to avoid having multiple zoo files, but all of the models should be available as well. if changing the json keys to include versions isn't good, perhaps we can update the schema to support multiple versions per model. changing the keys would be the simpler option, but what do you think?
another helpful thing for reproducibility, a DOI can be minted for huggingface model repositories. once a version is tagged, i can also mint a DOI for each repository and add that to the wsinfer json zoo information. that will greatly promote reproducibility because users can specify the model by DOI, which is more or less permanent.
perhaps we can update the schema to support multiple versions per model
Could let hf_revision
be an array, which would maybe be more logical, though would obviously need a UI tweak on our end.
Currently (v0.1.0) the extension always required to be online. If unable to query the models online, the stage can't even be initialized.
This requirement should be relaxed so that the extension works when offline, provided models have been downloaded and are available.
My suggestion would be to cache any relevant files near to the models. This can also be used for the SHA checking (storing the relevant file with the model, rather than relying upon timestamps).