gigascience / gigadb-website

Source code for running GigaDB
http://gigadb.org
GNU General Public License v3.0
9 stars 14 forks source link

Snorkel for ML-training data #384

Closed pli888 closed 3 years ago

pli888 commented 4 years ago

User Story

As a I want So that

@ScottBGI says:

Don't know if you've come across snorkel, but if one of the main downstream use cases for a lot of the imaging (medical and camera trap) and chemoinformatics and modelling data is for machine learning, it seems to be a new method for labeling functions (LFs) to label the training examples:

https://www.snorkel.org/features/

Is this something we want to adopt for obvious training datasets? Either as a curational step, or something we/the reviewers need to encourage the authors to do during the review/revision process?

The tool seems to be open source (Apache licensed):

https://github.com/snorkel-team/snorkel

There are some examples in this pre-print, including medical images:

https://arxiv.org/abs/1711.10160

The imaging series Chris A has been proposing potentially had this angle, so it could be a nice area to try to get an exemplar/showcase paper to start with.

ChrisArmit commented 3 years ago

Snorkel would be quite cumbersome for training image data. A user would have to write labelling functions rather than click on objects (see tutorial link below): https://www.snorkel.org/use-cases/visual-relation-tutorial

The BioImage Model Zoo, which is a database of pretrained Deep Learning models, is closer to what I have in mind to accompany the Digital Pathology Thematic Series. More details at the link below: https://bioimage.io/docs/#/

only1chunts commented 3 years ago

Do we still need this ticket then? @ChrisArmit you can create a backlog ticket for BioImage Model Zoo if you think thats a useful tool to consider integrating then we can close this ticket as we have no intention to implement snorkel.