getting low results on MNLI

I have tried to use the bert-small-mnli model hosted on HF, and seems I have a couple of problems:

The tokenizer doesn't seem to have a mxlen provided, so no padding happens AFAICT. Is that correct?
There's a difference in the results I get from the HF API and when I locally run the inference code. Which or all of these should I have in the batch? input_ids, token_type_ids and attention_mask? If I just use input_ids, the results seem to match with the HF API (there's still a small difference, but I am tokenizing a bit differently), but otherwise, there's a huge difference.
In any case, the HF API (and my local inference code) results are surprisingly low: for a sample of 100 validation_matched instances, I get an acc score of 21%. Do you have any insight on this? I am using hf datasets to download mnli data, so don't know if that is responsible for something. I saw something to that effect in the README, but quite not sure how should I change the labels in the dataset, if needed.

prajjwal1 / generalize_lm_nli