Massive Label Leakage in GAM/GAM* Implementation

VijayLingam95 commented 3 years ago

Hi GAM Authors,

I have noticed a massive label leakage bug in your implementation of GAM/GAM*. Both edge_iterator and pair_iterator are using true labels instead of predicted labels. Below are more details on label leakage bug in each of these iterators.

bug in edge_iterator (defined in trainer_classification_gcn.py) Source of label leakage: elif labeling == 'lu': edges = ( data.get_edges(src_labeled=True, tgt_labeled=False) + // Adds edges where src node is labeled and tgt node is unlabeled data.get_edges(src_labeled=False, tgt_labeled=True)) // Adds edges where src node is unlabeled and tgt node in unlabeled

I have bold the line of concern.

In line: 692, LU_edges and UL_edges are concatenated. Note that unlabeled edges are added as source nodes in the bolded line. We also see this by printing the edges variable in line::710.

While iterating through edges in Line::718, true labels are assigned to unlabeled indices. I have pasted the lines below and have highlighted the line of concern.

for edge in iterator: indices_src = edge[:, 0] indices_tgt = edge[:, 1] features_src = data.get_features(indices_src) features_tgt = data.get_features(indices_tgt) labels_src = data.get_labels(indices_src) labels_tgt = data.get_labels(indices_tgt) yield (indices_src, indices_tgt, features_src, features_tgt, labels_src, labels_tgt)

data.get_edges() returns true labels for unlabeled indices, thus showing massive improvements as reported in the paper.

Post fixing this label-leakage bug (either by remove UL edges and reversing UL edges), we can only observe marginal improvements over baselines.

bug in pair_iterator:

The way pair-iterator is defined in trainer_classification_gcn.py::633 printing variables labels_src, labels_tgt in line 668, 669, we can see that true label instead of predicted labels are assigned for LU and UU pair iterators. the _select_from_pool() method invoked by pair_iterator assigns labels by using data.get_labels(indices_batch). This call returns true labeles for unlabeled indices instead of predicted labels.

otiliastr commented 3 years ago

Hi Vijay,

Thanks for discovering this issue! I have made a pull request with a fix for the edge_iterator. See pull request #82. Given this change, we need to tune again the hyperparameters, so we are reruning some experiments and will post updates on the GAM repository.

However, I am not sure I understand the issue with "pair_iterator". You are right that it returns the true labels for unlabeled nodes, but these are not used. If you check the function "_construct_feed_dict" line 610 in "trainer_classification_gcn.py", we do not use the labels of the targets of LU edges (which are the unlabeled ones). Based on this, GAM* results should not be affected.

aheydon-google commented 2 years ago

Hi, Otilia and Krishna. Can this bug be closed, now that PR #82 has been merged? Thanks.

kvis-google commented 2 years ago

+Otilia Stretcu @.***>

Hi Otilia,

Can we close this?

On Mon, Jan 3, 2022 at 8:04 PM aheydon-google @.***> wrote:

Hi, Otilia and Krishna. Can this bug be closed, now that PR #82 https://github.com/tensorflow/neural-structured-learning/pull/82 has been merged? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/tensorflow/neural-structured-learning/issues/81#issuecomment-1004512944, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANBFXNXNWMY2WG5VGLY4NTDUUJWTBANCNFSM42WORAPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

kvis-google commented 2 years ago

Sure. I will update the readme and push a change by the end of the week.

On Tue, Jan 11, 2022 at 10:30 AM Krishnamurthy Viswanathan @.***> wrote:

+Otilia Stretcu @.***>

Hi Otilia,

Can we close this?

On Mon, Jan 3, 2022 at 8:04 PM aheydon-google @.***> wrote:

Hi, Otilia and Krishna. Can this bug be closed, now that PR #82 https://github.com/tensorflow/neural-structured-learning/pull/82 has been merged? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/tensorflow/neural-structured-learning/issues/81#issuecomment-1004512944, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANBFXNXNWMY2WG5VGLY4NTDUUJWTBANCNFSM42WORAPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

tensorflow / neural-structured-learning

Massive Label Leakage in GAM/GAM* Implementation #81