mm_predictor.extract_embedding() returning single "text_image" feature

nashapir commented 9 months ago

Hello,

I am trying to run the MuGNet portion of the run_baselines.sh script and running into an error inside generate_text_image_feature_by_pretrained_CLIP() function, specifically with the output of the clip model, clip_embeddings. It would appear, from the code, that we are expecting clip_embeddings to have two separate feature outputs, 'text' and 'image'. However, when I run the code I see a single feature under the key name 'image_text'. I stepped through the code in autogluon (that is invoked here), and confirmed that the input is two discrete columns with the names 'text' and 'image', and then that the output of inference, 'outputs', contains only a single feature called 'image_text'. I'm using the same version of autogluon (0.5.2), so not sure why there would be different expected behavior.

Thanks so much!

lujiaying commented 9 months ago

Hi nashapir,

can you please try to take a look at versions of autogluon related packages?

https://github.com/lujiaying/MUG-Bench/blob/d1796b1d09ca4dcde8b50a7cdce04448b80b7a93/conda_env.yml#L43-L52C10

    - autogluon==0.5.2
    - autogluon-common==0.5.2
    - autogluon-contrib-nlp==0.0.1b20220208
    - autogluon-core==0.5.2
    - autogluon-features==0.5.2
    - autogluon-multimodal==0.5.2
    - autogluon-tabular==0.5.2
    - autogluon-text==0.5.2
    - autogluon-timeseries==0.5.2
    - autogluon-vision==0.5.2

Currently on my end I don't encounter similar errors. It would be even better if you can provide a screenshot of the error, to reproduce the issue you observed.

Best, Jiaying

nashapir commented 9 months ago

Hi Jiaying,

Thank you for responding so quickly. I confirmed that all of my autoglon related packages are the correct version. I can show you the values before and after the mm_predictor.extract_embedding() call on line 149, here: https://github.com/lujiaying/MUG-Bench/blob/master/baselines/MuGNet/exec.py#L149.

Before: Here you can see the input is two lists of strings, one (image_paths) of image paths and another (text_cols_raw) of aggregated text.

After: Here you can see the output which is a dictionary with a single embedding and a single key: 'image_text'

Thanks for taking the time. Best, Nathan

lujiaying commented 9 months ago

Below is a proof-of-concept example, could you try that out to see if we can get similar output?

In [1]: from autogluon.multimodal import MultiModalPredictor

In [2]: img_paths = ['./datasets/Pokemon_PrimaryType/test_images/altaria.jpeg', './datasets/Pokemon_PrimaryType/test_images/bayleef.
   ...: jpeg']

In [3]: text_cols_raw = ["name: altaria", "name: bayleef"]

In [4]: mm_predictor = MultiModalPredictor(hyperparameters={"model.names": ["clip"]}, problem_type="zero_shot")

In [5]: clip_embeddings = mm_predictor.extract_embedding({"image": img_paths, 'text': text_cols_raw})
Predicting DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.49s/it]

In [6]: print(clip_embeddings)
{'image': array([[ 0.04921521,  0.01919144,  0.02189953, ...,  0.00350772,
         0.01533658,  0.01905328],
       [ 0.04044999,  0.02161232,  0.02332607, ..., -0.00211839,
         0.01493412,  0.00631776]], dtype=float32), 
 'text': array([[ 0.03126603,  0.04315126, -0.02046786, ...,  0.01667867,
         0.00745415,  0.02018833],
       [ 0.01461247,  0.03724204, -0.01419779, ..., -0.00341122,
         0.01787065,  0.03907847]], dtype=float32)}

In [7]: text_feats = clip_embeddings['text']

In [8]: print(text_feats.shape)
(2, 768)

Also worth noting I am using Python==3.9.12

nashapir commented 9 months ago

Confirmed that I am also using Python--3.9.12.

Below you can see that I have a different output from the same code. Very strange...

Inside of the autoglon package, does your definition of extract_embedding() on line 1610 of autogluon/multimodal/predictor.py look the same?

def extract_embedding(
        self,
        data: Union[pd.DataFrame, dict, list],
        return_masks: Optional[bool] = False,
        as_tensor: Optional[bool] = False,
        as_pandas: Optional[bool] = False,
    ):
        """
        Extract features for each sample, i.e., one row in the provided dataframe `data`.

        Parameters
        ----------
        data
            The data to extract embeddings for. Should contain same column names as training dataset and
            follow same format (except for the `label` column).
        return_masks
            If true, returns a mask dictionary, whose keys are the same as those in the features dictionary.
            If a sample has empty input in feature column `image_0`, the sample will has mask 0 under key `image_0`.
        as_tensor
            Whether to return a Pytorch tensor.
        as_pandas
            Whether to return the output as a pandas DataFrame (True) or numpy array (False).

        Returns
        -------
        Array of embeddings, corresponding to each row in the given data.
        It will have shape (#samples, D) where the embedding dimension D is determined
        by the neural network's architecture.
        """
        turn_on_off_feature_column_info(
            data_processors=self._data_processors,
            flag=True,
        )    
        outputs = self._predict(
            data=data,
            requires_label=False,
        )    
        if self._problem_type in [ZERO_SHOT]:
            features = extract_from_output(outputs=outputs, ret_type=COLUMN_FEATURES, as_ndarray=as_tensor is False)
            if return_masks:
                masks = extract_from_output(outputs=outputs, ret_type=MASKS, as_ndarray=as_tensor is False)
        else:
            features = extract_from_output(outputs=outputs, ret_type=FEATURES, as_ndarray=as_tensor is False)

        if as_pandas:
            features = pd.DataFrame(features, index=data.index)
            if return_masks:
                masks = pd.DataFrame(masks, index=data.index)
        import pdb; pdb.set_trace()
        if return_masks:
            return features, masks
        else:
            return features

nashapir commented 9 months ago

Also, within that extract_embedding() function, i have the following values:

data is {'image': ['./datasets/Pokemon_PrimaryType/test_images/altaria.jpeg', './datasets/Pokemon_PrimaryType/test_images/bayleef.jpeg'], 'text': ['name: altaria', 'name: bayleef']}

outputs is [{'column_features': {'features': {'image_text': tensor([[ 0.0908, 0.0558, 0.0108, ..., -0.0341, 0.0165, -0.0064], [ 0.0162, 0.0580, 0.0228, ..., -0.0211, 0.0186, 0.0129]])}, 'masks': {'image_text': tensor([1., 1.], dtype=torch.float16)}}}]

Seems like as early as the inference on line 1643 the fields have been combined.

lujiaying commented 9 months ago

I somehow remember the CLIP model used by autogluon.multimodal is depending on some third-party library. In that case, I would suggest re-installing the whole virtual environment (refer to https://github.com/lujiaying/MUG-Bench#prerequisites):

conda env create -n MuG_env --file conda_env.yml

For the ref document you paste from AutoGluon, I think AG has been consistently updated, and it is now on 0.8+.

My implementation is specifically on the versions as stated in the conda_env.yml, some third party APIs may be changed since it has been a while.

lujiaying commented 9 months ago

FYI, AG-MM 0.5.2 document: https://auto.gluon.ai/0.5.2/tutorials/multimodal/clip_embedding.html

nashapir commented 9 months ago

Just an update -- I believe the differences we're seeing are coming from the version of pytorch-lightning. If you install autogluon==0.5.2 it will use pytorch-lightning==1.6.5 which is from july of last year. That package, in turn, requires pytorch<=1.12.1, which is not available for CUDA versions past 11.6. Unfortunately nvidia doesn't offer support for 11.6 on ubuntu 22.04.

All of this is just to say... I'm not sure whether the version of pytorch-lightning that is generating the results you're showing is available for the new ubuntu distribution (which is what I'm trying to run it on). Not necessarily a bug, just a dependency note to be aware of.

Thanks again for your help.

lujiaying commented 9 months ago

Thanks for the explanation. If image_text returns a list of embeddings with length=2, a quick workaround can be use one of them as image embedding, and another as text embedding. Hope this somehow helpful.

nashapir commented 9 months ago

Unfortunately it's returning just a single embedding. Does it work to separate the two processes like this or would this create fundamentally different behavior?

mm_predictor = MultiModalPredictor(hyperparameters={"model.names": ["clip"]}, problem_type="zero_shot")

text_clip_embeddings = mm_predictor.extract_embedding({'text': text_cols_raw})
image_clip_embeddings = mm_predictor.extract_embedding({"image": img_paths})

image_feats = clip_embeddings['image_text'][0]
text_feats = clip_embeddings['image_text'][1]

lujiaying commented 9 months ago

I think your implementation works.

nashapir commented 9 months ago

Just following up on this -- I was able to reproduce the paper's accuracy and log loss using the MuGNet on the Pokemon Primary Type dataset. Assuming this alternative implementation, with separate feature extraction, is legitimate.

Thanks again.

lujiaying / MUG-Bench

mm_predictor.extract_embedding() returning single "text_image" feature #9