rafiibnsultan / GeoSAM

Fine-tuning SAM with Multi-Modal Prompts for Mobility Infrastructure Segmentation
https://sites.google.com/view/mlpa/mainpage
MIT License
55 stars 2 forks source link

Thank u! #2

Open AKAxiaobai-123 opened 7 months ago

AKAxiaobai-123 commented 7 months ago

Hello author, I read your article with great interest, as I am also a researcher in this field. I would like to know how the h5 file was generated. Could you please share some insights on this? Thank you for your assistance.

rafiibnsultan commented 7 months ago

Hello,

Thank you for showing interest.

The process to create an h5 file is straightforward. Begin by selecting a domain-specific encoder from a segmentation network that you intend to use as an auxiliary model. Ensure this encoder is applied to your chosen dataset. Rather than preserving the output images, the procedure involves storing the feature embeddings within an h5 file, typically the outputs from the layer just before the last layer. For an illustrative example, you might refer to the Tile2Net folder; we've modified the original source code to capture the embeddings, however, the approach is applicable to any domain-specific encoder.

AKAxiaobai-123 commented 7 months ago

Thank you for your prompt response. I went down to study the source code you modified in tilenet, and I only saw that you modified the ocrnet of the network, is that correct? However, I noticed that this encoder is not used anywhere. How exactly do you use it? Do you take out the encoder separately to load the weights and perform forward inference?

I'm really looking forward to your reply, as this is very important for my experiment.

rafiibnsultan commented 6 months ago

Sorry for the late reply. You are right, when we run Tile2Net on our training dataset, we save the image embeddings of each of the images and save it into the disc (what you see the process in ocrnet). Later, we use the embeddings.

AKAxiaobai-123 commented 6 months ago

Hello, I am very pleased to see your response. I have carefully studied it using my own road dataset. However, I found that during the inference process, my Intersection over Union (IoU) value is quite low. After a thorough examination of your code, I discovered that the mask values are generated through Tile2Net. Could you please explain the specific process of how they are generated? I am eagerly looking forward to your reply.Thanks!

rafiibnsultan commented 6 months ago

Hi, You can check the Tile2Net’s official repository. With some tweaking of their code you can save their semantic segmentation results (I believe we also have uploaded the tweaked code from our side in this repository). One other thing, you can use any other pre trained network. For example, you can train a UNet in your own dataset to do this same thing as well. Because prompts which are just a guide to SAM, don't have to be fully accurate.

Hope this helps!

AKAxiaobai-123 commented 6 months ago

Okay, thank you for your answer, I will give it a try, and I will let you know if I get good results.

AKAxiaobai-123 commented 5 months ago

I have trained according to your code, but the results obtained are not satisfactory, and I do not know the reason for this. The intersection over union (IoU) I obtained with the mask_decoder I trained is only 0.14, whereas with the decoder you trained, it is 0.41. Below are the result images obtained with your weights. Could you please take a look and see if there are any issues? Additionally, I have another question for you: Are the images in the masks folder that you trained with obtained through automated segmentation by SAM ? Looking forward to your reply, thank you! bj100001_sat_input …]() bj100001_mask bj100001_sat

output_image_0

rafiibnsultan commented 5 months ago

Hi, Sorry for the late reply. Let me reopen the issue so I don't miss it next. You might not get the best result with the decoder we trained if it doesn't match the classes that you are working on. You might further fine-tune it using your own dataset.

I didn't get the second question, what masks?

AKAxiaobai-123 commented 4 months ago

Thanks! I am not sure how the data in your dataset's masks file is described or how this mask is generated. Thank you once again for your answer. I am also curious to know if having multiple classes, such as seven, would result in less effective training outcomes? image

rafiibnsultan commented 4 months ago

We have uploaded a demo. check here: demo.ipynb Also, the demo folder in the "GeoSAM_with_text" has some examples of the dataset.

We also had multiple classes, 3 to be exact, if you want to have 7 you just have to change the one-hot-encoding according to it.