Birds of a Feather: How can we best ensure integrity in training data for machine learning / deep learning / AI solutions?

Due to their predictive power, AI/deep learning methods are increasingly used in many predictive tools. The availability of libraries like tensor-flow or scikit-learn, which allow relatively easy implementations of these methods further contribute to their popularity.

From my position as a (by training) experimental microbiologist without a specific statistics or machine learning background I sometimes get concerned on the quality and often VERY limited amount of training data (obtained experimentally) that sometimes is used to make very complex predictions.

In this Birds of a Feather session proposal, I would like to discuss with other experimental biologists from various fields if they observe similar issues, and with AI specialists, how we best can address this challenge.

NOTES FROM THE SESSION:

Intro: We at the Inst. for Biosustainability have build popular free software for predicting enzymes from gene sequences. We've tried to apply ML and AI algorithms to improve this prediction, but it has been hindered by the low quality and availability of data to train the algorithms.

tilmweber: also struggling with understanding how the new AI/ML algorithms work. Not as clear as the Support Vector Machines (SVM) of previous error.

simon: hard to judge if the data is good enough to use for a prediction. i.e., data sets are of unknown quality - or it might not be suitable for problem

where are the lines the define whether a data set is adequate for solving a problem? is it enough data to draw a conclusion? this scoring method might be an interesting place to start: https://www.ncbi.nlm.nih.gov/pubmed/20825684

cross-validation?

hard to validate predictions because it's probably a PhD worth of work to assess if the prediction is true

other problem with ML - it's a black box

false positives are often OK in the ML predictions. better to find the interesting possibilities for further study.

there is a "minimum information for gene clusters" standard

we use SVM, scikit learn, tensor flow

i tried ML for poetry generation as a first experiment

sometimes the value of ML is learning what are your gaps in data

we really need the biochemical data, but you need to justify its value (and effort to measure it) based on biochemical data - catch 22

lab to lab variation is often the major driver of data values

do you use only positive data?

there is pressure to use AI/ML even if you don't understand it

nnf-cbn / 2019-unconference

Birds of a Feather: How can we best ensure integrity in training data for machine learning / deep learning / AI solutions? #15