Cannot use 'Pedestrian' as text input to image_demo.py

jamesheatonrdm commented 7 months ago

I am trying to give text inputs for the image_demo, however for some reason when I use 'pedestrian' as a text input it is not recognised as a noun_phrase.

Example:

python demo/image_demo.py image.jpg configs/grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_cap4m.py --weights groundingdino_swint_ogc_mmdet-822d7e9d.pth --texts 'bench . pedestrian .'

The word 'pedestrian' does not appear to be recognized, the value of noun_phrases is ['bench'], and the output image does not contain the detected pedestrian.

However when I change the word from 'pedestrian' to 'person', everything works fine. The value of noun_phrases is ['bench', 'person'] and the person is detected in the image.

What could be causing this? Any help would be appreciated

jamesheatonrdm commented 7 months ago

What's even stranger is that I can appear to put almost anything I want, even nonsense words into the text input and everyting is recognised except for the word 'pedestrian'.

If I run the same command with --texts 'bench . pedestrian . gloop . vroom . wordwhichhasnomeaning .', the value of noun_phrases is ['bench', 'gloop', 'vroom', 'wordwhichhasnomeaning']

jamesheatonrdm commented 7 months ago

Stranger still is that using 'pedestrian' as the only text input works fine, but it appears to only happen when it is the only text input. For example --texts 'pedestrian .' gives a noun_phrases of ['pedestrian']. --texts 'pedestrian . car .' gives a noun_phrases of ['car'] and --texts 'bus . car .' gives a noun_phrases of ['bus', 'car']

vakker commented 7 months ago

This seems to be due to how the phrases are tagged and filtered. Take a look around here.

What's happening is that for the caption caption = 'bench . car . person . bicycle . pedestrian . asdasdasdas .' the pos_tag is this:

[('bench', 'NN'),          
 ('.', '.'),
 ('car', 'NN'),
 ('.', '.'),                              
 ('person', 'NN'),                      
 ('.', '.'),                                                                                           
 ('bicycle', 'NN'),
 ('.', '.'),
 ('pedestrian', 'JJ'),
 ('.', '.'),                                                                                           
 ('asdasdasdas', 'NNS'),
 ('.', '.')]

I.e. pedestrian is not tagged as a noun. This means that when the code reaches the subtree iteration, the pedestrian is filtered out:

for subtree in result.subtrees(): 
    print(subtree)

# prints:
(S
  (NP bench/NN)
  ./.
  (NP car/NN)
  ./.
  (NP person/NN)
  ./.
  (NP bicycle/NN)
  ./.
  pedestrian/JJ
  ./.
  (NP asdasdasdas/NNS)
  ./.)
(NP bench/NN)
(NP car/NN)
(NP person/NN)
(NP bicycle/NN)
(NP asdasdasdas/NNS)

I'm really no expert in linguistics, so I'm not sure if this is an nltk issue or how mmdet is using nltk. But in any case, "pedestrian" should be a noun, see here.

My workaround is using person, that seems to work

open-mmlab / mmdetection

Cannot use 'Pedestrian' as text input to image_demo.py #11485