Closed nv-jinhosuh closed 1 year ago
Hi, The main reason as far as I recall was to allow the test to be similar to the server mode, since the BERT inputs were partitioned at a certain size so when sending the input and partitioning at the server created different inputs/outputs. If we have the same issue with GPT-J it makes sense, I think. The issue then was https://github.com/mlcommons/inference/issues/1291. As far as I know from that issue Gavin is https://github.com/G4V
Thanks Lior! Yes I see GPT-j tensors are similarly preprocessed. Would you help bringing this up in WG? I guess we may need to touch up the policy doc (https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#benchmarks-and-qsl-preprocessing)?
yes. but isn't https://github.com/mlcommons/inference_policies/pull/275 already covering it?
Oh I didn't know this policy update. :) Looks like it's already covering it! Thank you @liorkhe !
Closing this.
Hi,
In v3.0, for BERT, we allowed LoadGen node to send InputTokenIDs tensors only and let SUT to reconstruct SegmentIDs/AttentionMask tensors.
Do we allow the similar thing in GPT-j for v3.1? I can see the same thing could be done.
@liorkhe FYI; also Lior: do you happen to know Gavin's handle from KRAI?