Exporting the cluster tree structure to Artemis to support automatic feedback generation

CanArisan commented 4 years ago

Problem:

Currently, cluster assignments and distances between text blocks in a cluster are imported from Athene, to be used for automatic assessment of text exercises. We want to additionally import the hierarchy between the clusters (a tree structure) and extend the imported distances within a cluster to distances between all blocks. The goal is to increase the amount of automatic feedback we can offer by analysing the hierarchy in Artemis.

Solution:

To improve the system as mentioned, we will be adding 2 new database tables. We will extend the current computation and export of the distance matrix from Athene to include all distances. We will also export the cluster tree in JSON format. The exported tree table looks like this:

We will store the tree nodes (each line in the above table) in the database. The columns in the table can be described as:

id: Unique key of the table entry
parent: treeId of the parent node of the edge in the cluster tree
child: treeId of the child node of the edge in the cluster tree
lambda_val: Inverted distance between two ends of the edge
child_size: Number of text blocks contained in the child node
exercise_id: Foreign key of the exercise, which the tree belongs

The distance matrix is currently stored in a 2D-Binary Array which is not optimally efficient. We will also store the pairwise distances in a table in the database, where we can efficiently query (table below). These two tables are the only new ones needed for our proposed solution.

The columns in the table above can be described as:

id: Unique key of the table entry
block_i, block_j: treeIds of the text blocks (It holds that block_i <= block_j)
distance: Distance between block_i and block_j
exercise_id: Foreign key of the exercise, which the tree belongs

The class diagram of the proposed system:

database_class_diagram_v2

The classes TextExercise, TextBlock and TextCluster already exist as entities. The methods and attributes in the diagram are only the new ones that we will be adding to these classes. treeId is added to TextBlock and TextCluster to make them identifiable in the cluster tree structure. TextCluster currently has an attribute distanceMatrix, which we will extract to its own class to include distances between all blocks. Attributes of TextPairwiseDistance and TextTreeNode are the same as the columns of same name of the database tables above. The list of TextPairwiseDistances and TextTreeNodes will be stored in TextExercise class. After the changes, the client application will still be only exposed to the entities text exercise and text block and will not need to know anything about the other entities.

Alternative solution:

We could also store the hierarchy and the distance matrix in Athene and make network calls when it is necessary. Two cases of this situation are: the block was initially not assigned to a cluster, or the block lacks feedback at that time although it is in a cluster.

Disadvantages:

This would result in a high amount of network calls (61% of text blocks are unassigned in sample dataset). Additionally, at the beginning of the assessment phase this number would be even higher, as there are not enough manual assessments yet. Exporting these to Artemis just once at the beginning would therefore make more sense.
All data is stored in Artemis currently. No storage in Athene.

jpbernius commented 4 years ago

@krusche can we get your opinion on this suggested database change?

krusche commented 4 years ago

I will look into this asap, but I won't be able to fully assess this before Friday. It looks rather complex to be honest. Is this really needed? Would it potentially make more sense to save this in an easier way? Potentially in the file system?

Can you add more rationale to the discussion? This would help to understand the consequences. I do not really understand the statement This would however result in a network call each time a block lacks automatic feedback (which is currently a high amount). It is also vague. When exactly is a network call needed?

CanArisan commented 4 years ago

I will look into this asap, but I won't be able to fully assess this before Friday. It looks rather complex to be honest. Is this really needed? Would it potentially make more sense to save this in an easier way? Potentially in the file system?

Can you add more rationale to the discussion? This would help to understand the consequences. I do not really understand the statement This would however result in a network call each time a block lacks automatic feedback (which is currently a high amount). It is also vague. When exactly is a network call needed?

Let me explain the rationale of the changes.

Currently, the way Athene behaves is as the following: After computing the clusters for the text blocks of a submission, Athene returns the block ids, probabilities of the assignment and the pairwise distance matrix between blocks for each cluster. All of these are saved in Artemis and after the export they are dumped in Athene. Probabilities and the distance matrix are both stored as blobs in the database table text_cluster and as byte arrays in the class TextCluster.

The main reason we want to make these changes is to provide automatic feedback to blocks which were initially not assigned to a cluster, or blocks in a cluster which lacks feedback at that time. The way we will do this (also the focus of my thesis) is by utilising the tree structure of the HDBSCAN algorithm used at the clustering step. The clusters resulting from this algorithm are connected to each other in a tree hierarchy, which gives us the opportunity to traverse the tree to look for neighbouring clusters which already have feedback. This way, we will be able to offer feedback to blocks in the situations mentioned above in bold. That is the reason why we have to store the tree structure somewhere.

We discussed with @jpbernius over the past weeks on how to store it and thought that it would be most logical to store it on Artemis side, as everything regarding the clustering results are saved there (basically nothing is saved on Athene side). Regarding When exactly is a network call needed?, we would need to make a network call to Athene each time a block lacks automatic feedback (the bold parts in the paragraph above), if we store the tree in Athene. For reference, in the sample EIST data that I worked on: there were 5918 text blocks, of which 3616 were not assigned a cluster. This means that, to make this traversal I mentioned, we need to make a network call to Athene and wait for its response for most of the blocks in an exercise. Additionally, at the beginning of the assessment phase this number would be even higher, as there are not enough manual assessments yet. To avoid this network communication, it would make more sense to store both the cluster assignments and the tree in the same system (Artemis).

On how the proposed changes would affect the entities and the database in Artemis: TextCluster already has its own table in the database. As mentioned, probabilities and distance_matrix are two of its columns. Probabilities currently has no use in the existing system and can be dropped entirely. Distance_matrix is currently only needed within a cluster, but with our extension distances between clusters would also be needed. Therefore storing it its own table would make sense. Apart from this, the way the matrix is stored now is not very optimal, as there is an overhead of converting it to a 2D-float array. This refactoring would also solve this issue. Lastly what needs to be stored is the ClusterTree. For a reference to its size, my sample data had ~7600 tree nodes for an exercise with 5918 blocks. What would be stored in the database are the TreeNode objects. Each of those can be uniquely identified by their "child" attribute. So, getting them from the table would be easy in that case. ClusterNode and BlockNode also do not require additional columns in the cluster tree table as it can be checked if a tree node is a block node with "child_size == 1".

I would appreciate any other feedback or suggestions on this issue and would be happy to answer any other questions. @krusche

krusche commented 4 years ago

Just discussed this with @jpbernius We discussed the following points:

The UML class diagram is inconsistent and incorrect => Please redraw the UML class diagram and make sure it is correct and consistent.
The database table examples are incomplete. => Please provide complete examples and highlight in the initial post that only 2 new tables are necessary
Make sure to highlight that the client application is still only exposed to the entities text exercise and text block and does not need to know anything about the other entities
For the implementation, please make sure to define Cascade, OrphanRemoval and corresponding delete operations in a way, that we cannot get any foreign key exceptions or any Uninitialized Proxy Exceptions (make sure to cover those cases in server integration tests).
Make sure to move all text exercise related domain classes into a sub package
Make sure to prefix all domain Java classes with Text and all database tables with text_

CanArisan commented 4 years ago

I updated the issue and implementation according to the points above. I will write some tests to assure the consistency in the database in the upcoming days.

krusche commented 4 years ago

Please also update this issue as requested and make sure that you cover everything!

ls1intum / Artemis