phac-nml / irida

Canada’s Integrated Rapid Infectious Disease Analysis Platform for Genomic Epidemiology
https://irida.ca
Apache License 2.0
40 stars 31 forks source link

Implement ID3 algorithm for data analysts #247

Open MichaelGerg opened 5 years ago

MichaelGerg commented 5 years ago

Currently performing features give diagrams for analysts to observe and hypothesize relationships. For analysts to more efficiently perform their job, the addition of quantifiable relationship data could help guide their research. The ID3 algorithm is used to determine the probability of cause and effect relationships among combinations of factors.

As an example, the analyst may want to determine which factors are causing a species divergence in a phylogenomic visualization. By categorizing a set of species data entries(selecting an arbitrary range of data as desired response variable), the algorithim could piece together that the factors "phagetype" and "source" being certain attributes have a 90% correlation to our interested divergence.

Algorithm showcase. Gain is associated gain of information(relationship strength) when combing factors. Algorithim is greedy so not computationally exhaustive. Will operate on factors with highest correlation using decision tree logic.

media_a9b_a9b62fea-17fc-4470-88f3-dc6e8f7886b3_phpdwktdi

id3 algorithm d1 d2 d14 9 5- outlook sunny overcast rain

apetkau commented 5 years ago

Thank you @MichaelGerg for the wonderful suggestion. We will take a look into it.

Do you know of any additional reading on the ID3 algorithm we could look at? Or existing places where it's used? I'm not as familiar with it.

What are the inputs to this algorithm? Is it just a tree and set of metadata for each sample?

How would you expect using this feature to work? That is, would you expect some way in the phylogenetic tree visualization to select groups of samples in a tree and ask the question "what entries in the existing metadata best correlate with the clustering defined by the tree"?

MichaelGerg commented 5 years ago

This is a simple introduction for how the algorithm works. https://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.html

This is a python implementation. Modifications would be necessary. https://github.com/tofti/python-id3-trees

An example where it might be used would be "determine whether a particular nucleotide pair within a pre-mRNA sequence corresponds to an mRNA splice site". The algorithm would take in sequences and results of cases with those sequences to build a tree and make a predictions for a newly inputted sequence.

ID3 would be a reasonable algorithm to run to build a network for prediction with increased prediction accuracy overtime after receiving more and more data which I believe the IRIDA system is intending to do. Essentially, this is a machine learning algorithm that could be extended to identify various cases and relationships with modification in the future.

At the moment though, single use case on each run makes the most sense. the inputs would be just the metadata. The desired samples would need to be classified together. An input option for the user to select the range of data entries to classify would be ideal for choosing the desired clustering. Then as you have suggested, the metadata with the best correlation would be presented. I would expect this to happen by first building the tree through the algorithm and then scanning the entries with tailored logic.

apetkau commented 5 years ago

Thanks for the excellent resources @MichaelGerg (the first link I think should be https://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/2.htm). We will keep this issue in mind.

Our current list of tasks is given in https://github.com/phac-nml/irida/projects/3. We may revisit this after we finish all those tasks.