Investigate model output failures in SD experimental roberta multiple choice test - Githubissues

tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

Apache License 2.0

420 stars 54 forks source link

Investigate model output failures in SD experimental roberta multiple choice test #5943

Open tt-rkim opened 7 months ago

tt-rkim commented 7 months ago

https://github.com/tenstorrent-metal/tt-metal/actions/runs/8131617987/job/22221127254

Output

2024-03-03T18:42:34.9174562Z models/experimental/roberta/tests/test_roberta_for_multiple_choice.py::test_roberta_for_multiple_choice [38;2;000;128;000m                  Metal[0m | [1m[38;2;100;149;237mINFO    [0m | Initializing device 0
2024-03-03T18:42:34.9176350Z [38;2;000;128;000m                 Device[0m | [1m[38;2;100;149;237mINFO    [0m | Opening user mode device driver
2024-03-03T18:42:34.9949834Z [32m2024-03-03 18:42:34.994[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 1 PCI device : {0}
2024-03-03T18:42:35.0127914Z [32m2024-03-03 18:42:35.012[0m | [1m[38;2;255;165;000mWARNING [0m | [36mSiliconDriver  [0m - init_detect_tt_device_numanodes(): Could not determine NumaNodeSet for TT device (physical_device_id: 0 pci_bus_id: 0000:00:08.0)
2024-03-03T18:42:35.0130331Z [32m2024-03-03 18:42:35.012[0m | [1m[38;2;255;165;000mWARNING [0m | [36mSiliconDriver  [0m - Could not find NumaNodeSet for TT Device (physical_device_id: 0 pci_bus_id: 0000:00:08.0)
2024-03-03T18:42:35.0143258Z [32m2024-03-03 18:42:35.014[0m | [1m[38;2;255;165;000mWARNING [0m | [36mSiliconDriver  [0m - bind_area_memory_nodeset(): Unable to determine TT Device to NumaNode mapping for physical_device_id: 0. Skipping membind.
2024-03-03T18:42:35.0145893Z [0;33m---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
2024-03-03T18:42:35.1393052Z [0m[38;2;000;128;000m                  Metal[0m | [1m[38;2;100;149;237mINFO    [0m | AI CLK for device 0 is:   1202 MHz
2024-03-03T18:42:36.1006741Z Some weights of RobertaForMultipleChoice were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
2024-03-03T18:42:36.1009189Z You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2024-03-03T18:42:39.1004944Z 2024-03-03 18:42:39.099 | INFO     | models.experimental.roberta.tests.test_roberta_for_multiple_choice:test_roberta_for_multiple_choice:51 - Running torch model...
2024-03-03T18:42:39.1712990Z 2024-03-03 18:42:39.170 | INFO     | models.experimental.roberta.tests.test_roberta_for_multiple_choice:test_roberta_for_multiple_choice:62 - Running tt model ...
2024-03-03T18:42:55.1174731Z 2024-03-03 18:42:55.116 | INFO     | models.experimental.roberta.tests.test_roberta_for_multiple_choice:test_roberta_for_multiple_choice:74 - Torch Predicted 0
2024-03-03T18:42:55.1176520Z 2024-03-03 18:42:55.117 | INFO     | models.experimental.roberta.tests.test_roberta_for_multiple_choice:test_roberta_for_multiple_choice:76 - Tt Predicted 1
2024-03-03T18:42:55.1196722Z 2024-03-03 18:42:55.119 | INFO     | models.experimental.roberta.tests.test_roberta_for_multiple_choice:test_roberta_for_multiple_choice:82 - Max ATOL Delta: 0.010220680385828018, Max RTOL Delta: 0.288716584444046
2024-03-03T18:42:55.1198585Z 2024-03-03 18:42:55.119 | INFO     | models.experimental.roberta.tests.test_roberta_for_multiple_choice:test_roberta_for_multiple_choice:83 - PCC: -1.0
2024-03-03T18:42:55.1200320Z 2024-03-03 18:42:55.119 | WARNING  | models.experimental.roberta.tests.test_roberta_for_multiple_choice:test_roberta_for_multiple_choice:88 - RobertaForMultipleChoice Failed!
2024-03-03T18:42:55.2065063Z torch.Size([1, 2, 35])
2024-03-03T18:42:55.2065476Z Shape([1, 1, 2, 35])
2024-03-03T18:42:55.2065922Z tensor([[-0.0451, -0.0456]])
2024-03-03T18:42:55.2068674Z Tensor([ [[[-0.036377, -0.0354004]]]], dtype=bfloat16 )
2024-03-03T18:42:55.2069108Z 
2024-03-03T18:42:55.2069831Z FAILED[38;2;000;128;000m                  Metal[0m | [1m[38;2;100;149;237mINFO    [0m | Closing device 0
2024-03-03T18:42:55.2390798Z [38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program Cache: disabled and cleared.
2024-03-03T18:42:55.2410535Z 
2024-03-03T18:42:55.2410701Z 
2024-03-03T18:42:55.2410914Z =================================== FAILURES ===================================
2024-03-03T18:42:55.2411559Z _______________________ test_roberta_for_multiple_choice _______________________
2024-03-03T18:42:55.2411984Z 
2024-03-03T18:42:55.2415014Z device = <tt_lib.device.Device object at 0x7f37e9d9a4f0>
2024-03-03T18:42:55.2415556Z 
2024-03-03T18:42:55.2415833Z     def test_roberta_for_multiple_choice(device):
2024-03-03T18:42:55.2416352Z         """
2024-03-03T18:42:55.2417201Z         RoBERTa for multiple choice is loading roberta-base pre-trained model,
2024-03-03T18:42:55.2418202Z         because there are no official weights for RobertaForMultipleChoice
2024-03-03T18:42:55.2418905Z         """
2024-03-03T18:42:55.2419245Z         torch.manual_seed(1234)
2024-03-03T18:42:55.2419734Z         base_address = ""
2024-03-03T18:42:55.2420122Z     
2024-03-03T18:42:55.2420434Z         with torch.no_grad():
2024-03-03T18:42:55.2421286Z             tokenizer = AutoTokenizer.from_pretrained("roberta-base")
2024-03-03T18:42:55.2422444Z             model = RobertaForMultipleChoice.from_pretrained("roberta-base")
2024-03-03T18:42:55.2422993Z             model.eval()
2024-03-03T18:42:55.2423292Z     
2024-03-03T18:42:55.2423868Z             prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
2024-03-03T18:42:55.2431152Z             choice0 = "It is eaten with a fork and a knife."
2024-03-03T18:42:55.2432633Z             choice1 = "It is eaten while held in the hand."
2024-03-03T18:42:55.2433068Z     
2024-03-03T18:42:55.2433758Z             encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors="pt", padding=True)
2024-03-03T18:42:55.2434393Z     
2024-03-03T18:42:55.2434637Z             # Tt roberta
2024-03-03T18:42:55.2435230Z             tt_model = TtRobertaForMultipleChoice(
2024-03-03T18:42:55.2435688Z                 config=model.config,
2024-03-03T18:42:55.2436082Z                 base_address=base_address,
2024-03-03T18:42:55.2436463Z                 device=device,
2024-03-03T18:42:55.2436830Z                 state_dict=model.state_dict(),
2024-03-03T18:42:55.2437239Z                 reference_model=model,
2024-03-03T18:42:55.2437597Z             )
2024-03-03T18:42:55.2437875Z             tt_model.eval()
2024-03-03T18:42:55.2438195Z     
2024-03-03T18:42:55.2438441Z             # Run torch model
2024-03-03T18:42:55.2438837Z             logger.info("Running torch model...")
2024-03-03T18:42:55.2439459Z             torch_outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()})
2024-03-03T18:42:55.2440141Z             torch_predicted_class = torch_outputs.logits.argmax().item()
2024-03-03T18:42:55.2440626Z     
2024-03-03T18:42:55.2440871Z             # Run tt model
2024-03-03T18:42:55.2441324Z             inputs_dict = {k: v.unsqueeze(0) for k, v in encoding.items()}
2024-03-03T18:42:55.2441891Z             print(inputs_dict["attention_mask"].shape)
2024-03-03T18:42:55.2442538Z             inputs_dict["attention_mask"] = torch.unsqueeze(inputs_dict["attention_mask"], 0)
2024-03-03T18:42:55.2443347Z             inputs_dict["attention_mask"] = torch2tt_tensor(inputs_dict["attention_mask"], device)
2024-03-03T18:42:55.2444018Z             print(inputs_dict["attention_mask"].shape())
2024-03-03T18:42:55.2444429Z     
2024-03-03T18:42:55.2444709Z             logger.info("Running tt model ...")
2024-03-03T18:42:55.2445157Z             tt_output = tt_model(**inputs_dict)
2024-03-03T18:42:55.2445655Z             tt_output_to_torch = tt2torch_tensor(tt_output.logits)
2024-03-03T18:42:55.2446194Z             tt_output_to_torch = tt_output_to_torch.squeeze(0)
2024-03-03T18:42:55.2446719Z             tt_output_to_torch = tt_output_to_torch.squeeze(0)
2024-03-03T18:42:55.2447269Z             tt_predicted_class = tt_output_to_torch.argmax().item()
2024-03-03T18:42:55.2447719Z     
2024-03-03T18:42:55.2447981Z             print(torch_outputs.logits)
2024-03-03T18:42:55.2448375Z             print(tt_output.logits)
2024-03-03T18:42:55.2448749Z             # Compare outputs
2024-03-03T18:42:55.2449058Z     
2024-03-03T18:42:55.2449303Z             # Torch output
2024-03-03T18:42:55.2449723Z             logger.info(f"Torch Predicted {torch_predicted_class}")
2024-03-03T18:42:55.2450182Z     
2024-03-03T18:42:55.2450544Z             logger.info(f"Tt Predicted {tt_predicted_class}")
2024-03-03T18:42:55.2450991Z     
2024-03-03T18:42:55.2451455Z             does_pass, pcc_message = comp_pcc(torch_outputs.logits, tt_output_to_torch, 0.98)
2024-03-03T18:42:55.2452022Z     
2024-03-03T18:42:55.2452514Z             # Temporarily change passing codition to allclose until layernorm accuracy is updated
2024-03-03T18:42:55.2453412Z             does_pass, allclose_message = comp_allclose(torch_outputs.logits, tt_output_to_torch, 0, 0.0081)
2024-03-03T18:42:55.2454088Z             logger.info(allclose_message)
2024-03-03T18:42:55.2454482Z             logger.info(pcc_message)
2024-03-03T18:42:55.2454827Z     
2024-03-03T18:42:55.2455066Z             if does_pass:
2024-03-03T18:42:55.2455467Z                 logger.info("RobertaForMultipleChoice Passed!")
2024-03-03T18:42:55.2455911Z             else:
2024-03-03T18:42:55.2456292Z                 logger.warning("RobertaForMultipleChoice Failed!")
2024-03-03T18:42:55.2456740Z     
2024-03-03T18:42:55.2456983Z >           assert does_pass
2024-03-03T18:42:55.2457307Z E           assert False
2024-03-03T18:42:55.2457608Z 
2024-03-03T18:42:55.2458005Z models/experimental/roberta/tests/test_roberta_for_multiple_choice.py:90: AssertionError
2024-03-03T18:42:55.2458742Z =========================== short test summary info ============================
2024-03-03T18:42:55.2459972Z FAILED models/experimental/roberta/tests/test_roberta_for_multiple_choice.py::test_roberta_for_multiple_choice - assert False
2024-03-03T18:42:55.2460934Z ====================== 1 failed, 19 deselected in 21.57s =======================

PCC is -1.0, but this is because it's a binary output. What should we do here? @boris-drazic

boris-drazic commented 7 months ago

This test starts failing from commit 9d198cf. It looks like the cause of failure is update in PyTorch version used since this commit.

If at this commit in pyproject.toml and requirements-dev.txt we change torch==2.2.1+cpu to torch==1.13.1+cpu (and in requirements-dev.txt change back to torchvision==0.14.1+cpu to match torch version) the test will pass.

With update in PyTorch version, output of TT models changes form Tensor([ [[[-0.0371094, -0.0390625]]]], dtype=bfloat16 ) to Tensor([ [[[-0.036377, -0.0354004]]]], dtype=bfloat16 ) and thus selects the wrong answer and test fails.

boris-drazic commented 7 months ago

The departure in produced values with different versions of PyTorch starts in models/experimental/roberta/tt/roberta_intermediate.py at lines:

torch_hidden_states = tt2torch_tensor(hidden_states)
torch_hidden_states = torch.nn.functional.gelu(torch_hidden_states)
hidden_states = torch2tt_tensor(torch_hidden_states, self.device)

where we do TT->Torch tensor conversion, torch.gelu, and Torch->TT tensor conversion.

tt-rkim commented 7 months ago

Hmm, any particular lines stand out? I would assume we should check the expected value of the tensor after tt2torch_tensor?

Unless it's difficult to see what the value should be there. I'm assuming torch.nn.functional.gelu isn't wrong, so I'm wondering if the conversion is going wrong. What kind of tensor are you using?

boris-drazic commented 7 months ago

TT is tensor with bfloat16. Yeah, I am assuming this is an issue with conversion of floats between TT and Torch. It is less likely but not impossible that implementation of Torch gelu changed between versions. Both needs to be checked.

tt-rkim commented 7 months ago

Can someone actively work on this, and should we skip this test?

boris-drazic commented 7 months ago

I think the best option is to skip the test for now. We have plenty of other tests for roberta that cover the base model and tests for other applications apart from multiple choice. This test was written a long time ago and has a bunch of fallback OPs and conversion between TT and Torch. I have tried running with gelu on device and thus avoiding conversions mentioned in the previous comment, but the test still fails at the end with new version of PyTorch. Debugging and making this test run will take a while.

tt-rkim commented 7 months ago

Should we delete this test?

boris-drazic commented 7 months ago

Yes, I will remove it