niki-amini-naieni / CounTX

Includes FSC-147-D and the code for training and testing the CounTX model from the paper Open-world Text-specified Object Counting.
MIT License
32 stars 2 forks source link

Additional Files #2

Closed jwahnn closed 2 months ago

jwahnn commented 4 months ago

So far, I followed the instructions in setting up the environment and completing the preparation steps.

Now, I am trying to run inference. I assume I have to change --im_dir, --FSC147_anno_file, and --data_split_file, but am facing problems understanding what --FSC147_anno_file and --data_split_file are referring to and where I can get them.

Are these files that happen to be generated during training? I wanted to use the existing baseline model for inference rather than training.

python test.py --data_split "val" --output_dir "./test" --resume "./results/checkpoint-1000.pth" --img_dir "/scratch/local/hdd/nikian/images_384_VarV2" --FSC147_anno_file "/scratch/local/hdd/nikian/annotation_FSC147_384.json" --FSC147_D_anno_file "./FSC-147-D.json" --data_split_file "/scratch/local/hdd/nikian/Train_Test_Val_FSC_147.json"

niki-amini-naieni commented 4 months ago

Hi, thank you for your comment! These files are provided by the dataset. You can find them inside the zip folder linked in the CounTX README.md. I have pasted the link here for convenience: link to data. When you download and extract the zip folder, you will see the relevant files. Please see the image pasted below for reference. Note that their file names match the ones in the command you have included in your message. Please feel free to ask more questions as they come up, and I will do my best to help in any way I can. image

jwahnn commented 4 months ago

I see. In that case, what if I want to test how well the model counts on a different set of images like my own dataset? Do I have to manually create these? What are the things that I have to prepare?

jwahnn commented 4 months ago

Also, even after extracting, I don't see those files. Can you upload them on the repo by chance?

niki-amini-naieni commented 4 months ago

Here is a link to the zip folder (containing all the files above) that I used: FSC-147 Link. I think it would be easiest to manually create the necessary dataset files. I will reply with clearer instructions on how to do so soon.

niki-amini-naieni commented 4 months ago

@jwahnn , could you please email me directly at niki.amini-naieni@eng.ox.ac.uk about this if you want further help? Let me know if you were able to download the necessary files, and then we can go from there.

jwahnn commented 4 months ago

Hi, I was waiting for the clearer instructions that you mentioned earlier. I will send a list of questions that I have via email if that is the preferred method. Let me know :)

niki-amini-naieni commented 4 months ago

RE: "I was waiting for the clearer instructions that you mentioned earlier:" Here are some initial instructions. Please feel free to ask more questions if they are not clear.

This is the command to get the results from the paper for the test set:

python test_reproduce_paper.py --data_split "test" --output_dir "./test" --resume "paper-model.pth" 
--img_dir "/scratch/local/hdd/nikian/images_384_VarV2" --FSC147_anno_file "/scratch/local/hdd/nikian/annotation_FSC147_384.json" 
--FSC147_D_anno_file "./FSC-147-D.json" --data_split_file "/scratch/local/hdd/nikian/Train_Test_Val_FSC_147.json"

The easiest method to deploy the model on your own dataset is the following:

  1. Create a folder with all the images you want to test inside of it. Preprocess all the images to have a height of 384 and a width $>=$ 384. The resizing is so you do not need to modify the density map averaging technique inherited from CounTR.
  2. Create a JSON file with dot annotations following this format:
    {
    "1050.jpg": {
        "points": [
            [
                247.53459230769232,
                212.71076056338032
            ],
            [
                247.14496153846153,
                194.39414084507044
            ],
            [
                229.89658461538463,
                66.2174647887324
            ],
            ...
        ]
    },
    ...
    }

    where each entry is a dictionary with the image name as the key (e.g., "1050.jpg") and another dictionary containing the dot annotations (the dictionary with entry "points" in the example above) as the value. Please see the file annotation_FSC147_384.json as an example. Note that annotation_FSC147_384.json contains more than just the points for each image. If you do not have dot annotations, you could just supply the final counts directly. I can provide a more detailed explanation on how to do this if you want. It would involve replacing this line of the code with the count. This might be relevant if you just want to use the model for inference and only have final counts for the images you are testing.

  3. Create a JSON file with the text description for each image. In the paper, we use a natural language response to the query

    what object should be counted?

But a simple class name will also do. The JSON file should have the following format:

{
    "2.jpg": {
        "data_split": "test",
        "text_description": "the sea shells"
    },
    "3.jpg": {
        "data_split": "test",
        "text_description": "the hot air balloons"
    },
...
}

where each entry is a dictionary with the image name as the key (e.g., "2.jpg," "3.jpg") and another dictionary containing the data split and text description as the value. Please see the file FSC-147-D.json as an example.

  1. Create a JSON file with the data splits (e.g., train, test, val) following this format:
    {
    "test": [
        "2.jpg",
        "3.jpg",
        "4.jpg",
        "5.jpg",
        "6.jpg",
        ...
    ],
    "val": [
        "190.jpg",
        "191.jpg",
        "192.jpg",
        ...
    ],
    "train": [
        "7.jpg",
        "9.jpg",
        "19.jpg",
        ...
    ]
    }

    where the keys are the data splits and the values are the lists of image names in the corresponding data split.

  2. Modify the previous command with the new file names:
    python test_reproduce_paper.py --data_split "test" --output_dir "./test" --resume "paper-model.pth" 
    --img_dir <response to 1> --FSC147_anno_file <response to 2>
    --FSC147_D_anno_file <response to 3> --data_split_file <response to 4>

    One point though is that I have not tested running inference on another dataset other than the ones in the paper, so while I think these instructions are generally correct, you might need to do some further work with the code to get the inference to run with no errors on your dataset. Although, the inference code should run smoothly and exactly reproduce the results of the paper for FSC-147.

RE: I will send a list of questions that I have via email if that is the prefer method. Let me know :) I think it would be a good idea to include your questions on this forum for documentation purposes. However, I would be careful not to disclose confidential information (such as paper ideas) on the public GitHub repository for everyone to see. In that case, feel free to email me directly. Also, thank you for the interest in our work!

jwahnn commented 4 months ago

This is amazing! Thanks for sharing all this. I really appreciate it. I will be working on checking in the next few days as to whether this works smoothly without issues. In the meantime, I have three more questions:

  1. Would it be fine for me to run 'test.py' instead of 'test_reproduce_paper.py' like you did?
  2. Is there a way to get the images like in the README page to know which parts of the images the model is detecting?
  3. If we are trying to count objects with holes in the middle (e.g. donuts), where will we place the dot annotations?
niki-amini-naieni commented 4 months ago

No problem! Yes, please let me know any issues as they come up, and I will try to help.

RE: Would it be fine for me to run 'test.py' instead of 'test_reproduce_paper.py' like you did?: Yes, it would be fine. I just tried it. I recall the reason I included both files is that I was refactoring the class definition of the model for clarity after submitting the paper. However, I just tried using test.py instead of test_reproduce_paper.py, and I got the same results. RE: Is there a way to get the images like in the README page to know which parts of the images the model is detecting?: Yes, just overlay the density_map / 60 on this line on top of the image. RE: If we are trying to count objects with holes in the middle (e.g. donuts), where will we place the dot annotations?: Place the dots roughly in the center of the object (i.e., the average image position of all pixels belonging to the object, even if that position is not on the object itself). For example, here is an image of an annotation from FSC-147 of donuts. 3765

jwahnn commented 4 months ago

overlayed8 Is this what it is supposed to look like? It doesn't seem right for me...

Below is the code that entered in between pred_cnt = torch.sum(density_map / 60).item() and gt_cnt = gt_dots.shape[1]:

###
tens = density_map / 60
transform = T.ToPILImage()
dens = transform(tens)
dens = dens.convert("RGBA")
print("Density: ", type(dens))

dir = os.listdir("processed_img")
dir.sort()
print(dir)
background = Image.open("processed_img/" + dir[count])
background = background.convert("RGBA")
print("Image: ", type(background))

new_img = Image.blend(background, dens, 0.5)
new_img.save("overlayed" + str(count) + ".png")
count += 1
###

EDIT: I assume this is a problem with my dot annotations. In this case, do you mind guiding me with supplying the ground truth count directly instead of dot annotations?

niki-amini-naieni commented 4 months ago

Hi, I will respond in a bit. I have to work on ECCV submissions. Stay tuned. Also, thanks for sharing the code. It really helps with diagnosing the issue!

changcongxun commented 4 months ago

RE: "I was waiting for the clearer instructions that you mentioned earlier:" Here are some initial instructions. Please feel free to ask more questions if they are not clear.

This is the command to get the results from the paper for the test set:

python test_reproduce_paper.py --data_split "test" --output_dir "./test" --resume "paper-model.pth" 
--img_dir "/scratch/local/hdd/nikian/images_384_VarV2" --FSC147_anno_file "/scratch/local/hdd/nikian/annotation_FSC147_384.json" 
--FSC147_D_anno_file "./FSC-147-D.json" --data_split_file "/scratch/local/hdd/nikian/Train_Test_Val_FSC_147.json"
  • --data_split refers to the data split of FSC-147 that you would like to evaluate. The options are "val" and "test".
  • --output_dir refers to the name of the directory where the log file for the test run will be saved.
  • --resume refers to the file name of the model checkpoint that you would like to test. To get the results from the paper, download the model checkpoint named "paper-model.pth" from this link, which is also available on the main README.md of this repository.
  • --img_dir refers to the name of the folder containing the images in FSC-147. It is available in this zip folder.
  • --FSC147_anno_file refers to the JSON file in this zip folder that mainly contains the dot annotations for the images in FSC-147 (i.e., the center coordinates of the instances of the object to count in each image).
  • --FSC147_D_anno_file refers to the JSON file in this repository available here that contains the input text descriptions for all the images in FSC-147.
  • --data_split_file refers to the JSON file with the data splits in FSC-147 (e.g., train, test, val). It is also available in this zip folder.

The easiest method to deploy the model on your own dataset is the following:

  1. Create a folder with all the images you want to test inside of it. Preprocess all the images to have a height of 384 and a width >= 384. The resizing is so you do not need to modify the density map averaging technique inherited from CounTR.
  2. Create a JSON file with dot annotations following this format:
{
    "1050.jpg": {
        "points": [
            [
                247.53459230769232,
                212.71076056338032
            ],
            [
                247.14496153846153,
                194.39414084507044
            ],
            [
                229.89658461538463,
                66.2174647887324
            ],
            ...
        ]
    },
...
}

where each entry is a dictionary with the image name as the key (e.g., "1050.jpg") and another dictionary containing the dot annotations (the dictionary with entry "points" in the example above) as the value. Please see the file annotation_FSC147_384.json as an example. Note that annotation_FSC147_384.json contains more than just the points for each image. If you do not have dot annotations, you could just supply the final counts directly. I can provide a more detailed explanation on how to do this if you want. It would involve replacing this line of the code with the count. This might be relevant if you just want to use the model for inference and only have final counts for the images you are testing. 3. Create a JSON file with the text description for each image. In the paper, we use a natural language response to the query

what object should be counted?

But a simple class name will also do. The JSON file should have the following format:

{
    "2.jpg": {
        "data_split": "test",
        "text_description": "the sea shells"
    },
    "3.jpg": {
        "data_split": "test",
        "text_description": "the hot air balloons"
    },
...
}

where each entry is a dictionary with the image name as the key (e.g., "2.jpg," "3.jpg") and another dictionary containing the data split and text description as the value. Please see the file FSC-147-D.json as an example. 4. Create a JSON file with the data splits (e.g., train, test, val) following this format:

{
    "test": [
        "2.jpg",
        "3.jpg",
        "4.jpg",
        "5.jpg",
        "6.jpg",
        ...
    ],
    "val": [
        "190.jpg",
        "191.jpg",
        "192.jpg",
        ...
    ],
    "train": [
        "7.jpg",
        "9.jpg",
        "19.jpg",
        ...
    ]
}

where the keys are the data splits and the values are the lists of image names in the corresponding data split. 5. Modify the previous command with the new file names:

python test_reproduce_paper.py --data_split "test" --output_dir "./test" --resume "paper-model.pth" 
--img_dir <response to 1> --FSC147_anno_file <response to 2>
--FSC147_D_anno_file <response to 3> --data_split_file <response to 4>

One point though is that I have not tested running inference on another dataset other than the ones in the paper, so while I think these instructions are generally correct, you might need to do some further work with the code to get the inference to run with no errors on your dataset. Although, the inference code should run smoothly and exactly reproduce the results of the paper for FSC-147.

RE: I will send a list of questions that I have via email if that is the prefer method. Let me know :) I think it would be a good idea to include your questions on this forum for documentation purposes. However, I would be careful not to disclose confidential information (such as paper ideas) on the public GitHub repository for everyone to see. In that case, feel free to email me directly. Also, thank you for the interest in our work!

Hi, I followed the prompts above to test my dataset, but encountered the following error. May I ask what the reason may be? image

niki-amini-naieni commented 4 months ago

What is the size of your input image? It seems that the error is that your image is too large. Did you follow the preprocessing step in the code? Do you have some code for me to look at?

changcongxun commented 4 months ago

What is the size of your input image? It seems that the error is that your image is too large. Did you follow the preprocessing step in the code? Do you have some code for me to look at?

Thanks for your reply! I modified the size of my image according to step 1, the code can run normally now!

from PIL import Image, ImageOps

import os

定义输入和输出文件夹路径

input_folder = "./images" output_folder = "./images_384"

确保输出文件夹存在

os.makedirs(output_folder, exist_ok=True)

遍历输入文件夹中的所有图片文件

for filename in os.listdir(input_folder): if filename.endswith((".jpg", ".jpeg", ".png")): input_path = os.path.join(input_folder, filename) output_path = os.path.join(output_folder, filename)

    # 打开图片并修改尺寸
    image = Image.open(input_path)
    width, height = image.size
    new_height = 384
    new_width = int(width * (new_height / height))

    # 如果新宽度小于 384,则先填充,再调整尺寸
    if new_width < 384:
        pad_width = (384 - new_width) // 2
        padding = (pad_width, 0, 384 - new_width - pad_width, 0)
        resized_image = ImageOps.expand(image, padding).resize((384, new_height), Image.LANCZOS)
    else:
        resized_image = image.resize((new_width, new_height), Image.LANCZOS)

    # 保存修改后的图片
    resized_image.save(output_path)

print("All images processed and saved successfully.")

niki-amini-naieni commented 4 months ago

That is good. Do you have your main file that you are running the network on and the example image that gives you the error? I will run it on my end.

changcongxun commented 4 months ago

That is good. Do you have your main file that you are running the network on and the example image that gives you the error? I will run it on my end.

Indeed, the input size of the image is incorrect. After modifying the size again, the code can now run normally. I shared my code in the comments above

niki-amini-naieni commented 4 months ago

Awesome! Let me know if you have more questions.

niki-amini-naieni commented 4 months ago

overlayed8 Is this what it is supposed to look like? It doesn't seem right for me...

Below is the code that entered in between pred_cnt = torch.sum(density_map / 60).item() and gt_cnt = gt_dots.shape[1]:

###
tens = density_map / 60
transform = T.ToPILImage()
dens = transform(tens)
dens = dens.convert("RGBA")
print("Density: ", type(dens))

dir = os.listdir("processed_img")
dir.sort()
print(dir)
background = Image.open("processed_img/" + dir[count])
background = background.convert("RGBA")
print("Image: ", type(background))

new_img = Image.blend(background, dens, 0.5)
new_img.save("overlayed" + str(count) + ".png")
count += 1
###

EDIT: I assume this is a problem with my dot annotations. In this case, do you mind guiding me with supplying the ground truth count directly instead of dot annotations?

Yes, absolutely! I will respond soon...

changcongxun commented 4 months ago

Awesome! Let me know if you have more questions.

Thank you very much. I would like to apply CountTX to my own dataset (with a sample numbers of only 220 and only one counting category per image). Currently, I have only used the inference interface test-reproduce_paper. py, with a Test MAE of 4.37 and a Test RMSE of 7.89 for 154 test images. If I retrain with CountTX, is it possible to achieve better results? I hope the MAE is close to 0, Because I want to obtain the accurate number of categories in the image. Also, do you have any other suggestions to improve the counting accuracy, I would greatly appreciate it! The dataset images are shown below: 230727T210045S042_000120923412_color_3

niki-amini-naieni commented 4 months ago

Do all the images look similar to the one you have shown? Do they all have the same text description?

changcongxun commented 4 months ago

Do all the images look similar to the one you have shown? Do they all have the same text description?

Yes, they all come from stacked cardboard boxes on shelves in the logistics warehouse,they all have the same text description “"text_description": "the box"” image

niki-amini-naieni commented 3 months ago

I think the easiest approach to try first would be to add your images to FSC-147, and then use the provided training code to train on the mixed dataset composed of the FSC-147 images and the boxes images. "the boxes" is already a text description in FSC-147-D (see the image below).

image

To do this, just add your images to the folder containing the FSC-147 images and then modify the FSC-147 dataset files accordingly (see above directions for how to deploy the model on your own dataset for reference).

If that does not provide better performance, you could try finetuning the pretrained CounTX model on the new dataset following the same procedure that was used in the paper for training on only FSC-147. I have other ideas, but I think the above is probably enough to get started.

niki-amini-naieni commented 3 months ago

Another test to try is to change the text description to "the boxes" instead of "the box."

changcongxun commented 3 months ago

I think the easiest approach to try first would be to add your images to FSC-147, and then use the provided training code to train on the mixed dataset composed of the FSC-147 images and the boxes images. "the boxes" is already a text description in FSC-147-D (see the image below).

image

To do this, just add your images to the folder containing the FSC-147 images and then modify the FSC-147 dataset files accordingly (see above directions for how to deploy the model on your own dataset for reference).

If that does not provide better performance, you could try finetuning the pretrained CounTX model on the new dataset following the same procedure that was used in the paper for training on only FSC-147. I have other ideas, but I think the above is probably enough to get started.

Thanks for your detailed recovery. I really appreciate it. Prior to this, I had already started training from scratch on my own dataset (155 training sets, 44 validation sets, 22 test sets, with text_description "the box"), and all parameters remained unchanged except for epochs. The results are as follows. image

The following work will follow your suggestions, and I will share my results later.

niki-amini-naieni commented 3 months ago

Great, I look forward to seeing your updates!

changcongxun commented 3 months ago

Great, I look forward to seeing your updates!

When reading your paper, I saw the parts marked in yellow below. image image

I would like to confirm that I have added my own training set, validation set, and test set data to the corresponding training set, validation set, and test set of FSC-147. Is this correct?

niki-amini-naieni commented 3 months ago

Yes, you could try this to see if it changes performance, but what you have already done may be better. When you add your data, I would use only either "the box" or "the boxes" instead of both (i.e., either change the text description for your dataset to "the boxes" or just change the existing text description in FSC-147-D to "the box"). Again, what you have already done may be better since now the model is specialized in counting boxes. How many images total was your model trained on?

changcongxun commented 3 months ago

Yes, you could try this to see if it changes performance, but what you have already done may be better. When you add your data, I would use only either "the box" or "the boxes" instead of both (i.e., either change the text description for your dataset to "the boxes" or just change the existing text description in FSC-147-D to "the box"). Again, what you have already done may be better since now the model is specialized in counting boxes. How many images total was your model trained on?

I have changed the text description for my dataset to "the boxes". The total number of images for the model I trained before was 220, with 155, 44, and 22 images for the training, validation, and testing sets, respectively.

niki-amini-naieni commented 3 months ago

FSC-147 has 6135 images, so it would take a lot more time to train. Are you using early stopping (i.e., checking the performance on the validation set for each epoch and using the model that achieves the lowest errors)? Are you using any data augmentation to superficially increase the size of the training set? Is there any chance you could increase the number of images used for training (maybe even by checking if images in FSC-147 with corresponding labels "the boxes" could be added to your training set)? You could still try adding your images to FSC-147 (and this worked well in the case of CARPK) and train on the mixed dataset, but it might not help much.

niki-amini-naieni commented 3 months ago

Also, if you examine the images the model fails on (since your dataset is small, this should be easier), you might get some insights into how to improve the performance.

niki-amini-naieni commented 3 months ago

I found a dataset with images of cardboard boxes: https://app.roboflow.com/ds/ZCwOYJLruw?key=SjV4l9bmj5. If these boxes look like the ones you have in your dataset, you can convert the bounding box labels to dot annotations and train on a much larger dataset of boxes.

niki-amini-naieni commented 3 months ago

Finally, since your dataset is small, it might be better to keep the CounTX image and text encoders frozen and just finetune the feature interaction model (smaller set of parameters) on your specific dataset.

changcongxun commented 3 months ago

Finally, since your dataset is small, it might be better to keep the CounTX image and text encoders frozen and just finetune the feature interaction model (smaller set of parameters) on your specific dataset.

I appreciate your support very much. Firstly, I want to keep the CounTX image and text encoders frozen and just finetune the feature interaction model on my own dataset. The parameter quantity is shown in the following figure. image To freeze these layers, I only added two lines of code in train.py, as shown below image After completing the above modifications, I will change the epochs to observe model performance while keeping the remaining parameters unchanged. I will share my results later.

changcongxun commented 3 months ago

Finally, since your dataset is small, it might be better to keep the CounTX image and text encoders frozen and just finetune the feature interaction model (smaller set of parameters) on your specific dataset.

I appreciate your support very much. Firstly, I want to keep the CounTX image and text encoders frozen and just finetune the feature interaction model on my own dataset. The parameter quantity is shown in the following figure. image To freeze these layers, I only added two lines of code in train.py, as shown below image After completing the above modifications, I will change the epochs to observe model performance while keeping the remaining parameters unchanged. I will share my results later.

I didn't realize it before,this line in train.py should be changed from blank to "./paper model. pth" image

changcongxun commented 3 months ago

Finally, since your dataset is small, it might be better to keep the CounTX image and text encoders frozen and just finetune the feature interaction model (smaller set of parameters) on your specific dataset.

I appreciate your support very much. Firstly, I want to keep the CounTX image and text encoders frozen and just finetune the feature interaction model on my own dataset. The parameter quantity is shown in the following figure. image To freeze these layers, I only added two lines of code in train.py, as shown below image After completing the above modifications, I will change the epochs to observe model performance while keeping the remaining parameters unchanged. I will share my results later.

I didn't realize it before,this line in train.py should be changed from blank to "./paper model. pth" image

Hi, I'm back again with my experimental results.

image It seems that fine-tuning is not the best option.

The following is a graph depicting the variation of Train MAE and val MAE with epochs based on the three fine-tuning scenarios in the table above.

image image image As mentioned above, do you think the experimental results are reasonable? Is there anything I need to improve my tuning skills? Or is increasing the number of images the only option?

niki-amini-naieni commented 3 months ago

At this point, I would just train from scratch on more data. When you train from scratch, do you freeze the text encoder and finetune the image encoder as was done in the paper? For reference, you can transform any detection dataset for boxes to one compatible with CounTX by taking the box centers as the dot annotations. The same logic is true for instance segmentation datasets.

changcongxun commented 3 months ago

At this point, I would just train from scratch on more data. When you train from scratch, do you freeze the text encoder and finetune the image encoder as was done in the paper? For reference, you can transform any detection dataset for boxes to one compatible with CounTX by taking the box centers as the dot annotations. The same logic is true for instance segmentation datasets.

Yes, when I start training from scratch, I freeze the text encoder and fine tune the image encoder as in the paper.

niki-amini-naieni commented 3 months ago

Yes, so at this point, I would just increase the size of the training set and train from scratch. Have you tried training on a mix of FSC-147 and your dataset (asking out of curiosity)?

changcongxun commented 3 months ago

Yes, so at this point, I would just increase the size of the training set and train from scratch. Have you tried training on a mix of FSC-147 and your dataset (asking out of curiosity)?

Thank you, I will expand the number of images in the dataset and look forward to better performance.

changcongxun commented 3 months ago

Yes, so at this point, I would just increase the size of the training set and train from scratch. Have you tried training on a mix of FSC-147 and your dataset (asking out of curiosity)?

Due to the large dataset and limited GPU computing power, it will take about a week to obtain the results. I will share the results at that time.

changcongxun commented 3 months ago

Hi, I'm back again, there are experimental results now.

image

Based on the my experimental results, the following conclusions can be drawn:

  1. The performance of the mixed datasets(FSC147+custom datasets) is lower than that of custom datasets trained from scratch.
  2. Increasing the number of training samples yields the best performance(box-6500images+custom datasets), but if tested only on my custom dataset, the performance will be lower than all other models.
  3. Perhaps only by adding samples that are exactly similar to the custom dataset can better performance be achieved, rather than box-6500images.

If you have any other suggestions that may improve the performance of the model, I would greatly appreciate it!

niki-amini-naieni commented 3 months ago

Hi, thanks for these results. To clarify, the MAE of 2.14 and 2.29 are on the joint dataset or just your custom dataset? Those seem like pretty low error values. What is the average number of boxes per image in the dataset that produced those errors? You could also try data augmentation to falsely grow the size of your existing training set, but I am not sure how much that would help. Are you using any data augmentation in your current training pipeline? For example, this paper is about a data augmentation method that improves the performance of CounTR (the model CounTX is based off of) on object counting. Are you using early stopping? Early stopping significantly improved CounTX's results. However, the results that you have might be the best that you can do with the existing method.

niki-amini-naieni commented 3 months ago

overlayed8 Is this what it is supposed to look like? It doesn't seem right for me...

Below is the code that entered in between pred_cnt = torch.sum(density_map / 60).item() and gt_cnt = gt_dots.shape[1]:

###
tens = density_map / 60
transform = T.ToPILImage()
dens = transform(tens)
dens = dens.convert("RGBA")
print("Density: ", type(dens))

dir = os.listdir("processed_img")
dir.sort()
print(dir)
background = Image.open("processed_img/" + dir[count])
background = background.convert("RGBA")
print("Image: ", type(background))

new_img = Image.blend(background, dens, 0.5)
new_img.save("overlayed" + str(count) + ".png")
count += 1
###

EDIT: I assume this is a problem with my dot annotations. In this case, do you mind guiding me with supplying the ground truth count directly instead of dot annotations?

Hi @jwahnn are you still having issues?

changcongxun commented 3 months ago

Hi, thanks for these results. To clarify, the MAE of 2.14 and 2.29 are on the joint dataset or just your custom dataset? Those seem like pretty low error values. What is the average number of boxes per image in the dataset that produced those errors? You could also try data augmentation to falsely grow the size of your existing training set, but I am not sure how much that would help. Are you using any data augmentation in your current training pipeline? For example, this paper is about a data augmentation method that improves the performance of CounTR (the model CounTX is based off of) on object counting. Are you using early stopping? Early stopping significantly improved CounTX's results. However, the results that you have might be the best that you can do with the existing method.

  1. the MAE of 2.14 and 2.29 are on the joint dataset (box-6500images+custom datasets, Train on the joint dataset and test on the joint dataset), the MAE of 6.10 and 5.34 are on my custom dataset (Train on the joint dataset and test on my custom dataset).
  2. At present, I have not calculated the average number of boxes per image. It should be close to the average number of boxes per image on the box-6500images dataset, as this dataset has 3969 images, while my custom dataset only has 220 images.
  3. I did not perform any data augmentation or early stopping on my own. When using the train.py you provided for training, except for modifying the parameters related to the dataset, no other code changes were made. Is there a data augmentation section or early stopping included in the train.py file? Can you give me some guidance on how to add data augmentation and early stopping in the original code.
niki-amini-naieni commented 3 months ago

If you use the original train.py file, it already includes data augmentation and early stopping. I think increasing the number of samples from your specific dataset would be the best option. What do you see are the differences between your dataset and the new dataset?

changcongxun commented 3 months ago

If you use the original train.py file, it already includes data augmentation and early stopping. I think increasing the number of samples from your specific dataset would be the best option. What do you see are the differences between your dataset and the new dataset?

Thank you, my specific dataset is difficult to expand due to some objective factors. I intuitively feel that there is not much difference between the two datasets. I have another idea. I want to pretrain on an extended box dataset and finetune on my specific dataset. Do you think it's reliable? I can collect more box images than I do now.

niki-amini-naieni commented 3 months ago

Yes, this is a good idea to try. I will attach a file here (it is not super neat and not ready for posting to the main GitHub) that shows how I trained on CARPK. I got the best performance not when finetuning on CARPK after training on FSC-147, but when jointly training on both CARPK and FSC-147. I used a sampling procedure that controlled the fraction of the batch made up of samples from CARPK. You could use a similar approach. You could compose each training batch such that 60 % of the data comes from your specific dataset and 40 % comes from the other boxes dataset (or use some other split). Again, we are just trying ideas at this point. I am not sure if these approaches will improve performance significantly.

import argparse
import datetime
import json
import numpy as np
import os
import time
import random
from pathlib import Path
import math
import sys
from PIL import Image

from typing import List, Union, Iterable, Iterator

from scipy.stats import bernoulli
import scipy.ndimage as ndimage

import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
from torch.utils.data import Dataset, ConcatDataset, Sampler
import torchvision

from torchvision.transforms import Normalize, Compose, Resize
from torchvision.transforms.functional import InterpolationMode
import torchvision.transforms.functional as TF

import timm
import hub

# Check the correct version of [timm] is installed.
assert timm.__version__ == "0.3.2"
import timm.optim.optim_factory as optim_factory

import util.misc as misc
from util.misc import NativeScalerWithGradNormCount as NativeScaler
import util.lr_sched as lr_sched
from util.FSC147 import TransformTrain, TTensor, clip_tokenizer
import models_counting_network

import open_clip

import wandb

def get_args_parser():
    parser = argparse.ArgumentParser("Training Class-Agnostic Counting Network")

    parser.add_argument(
        "--batch_size",
        default=8,
        type=int,
    )

    parser.add_argument("--epochs", default=1000, type=int)

    parser.add_argument(
        "--model",
        default="main_counting_network",
        type=str,
        help="name of model to train",
    )

    parser.add_argument("--weight_decay", type=float, default=0.05)

    parser.add_argument(
        "--lr",
        type=float,
        default=None,
        help="learning rate (absolute lr)",
    )

    parser.add_argument(
        "--blr",
        type=float,
        default=2e-4,
        help="base learning rate: absolute_lr = base_lr * batch_size / 256",
    )

    parser.add_argument(
        "--min_lr",
        type=float,
        default=0.0,
        help="lower lr bound for cyclic schedulers that hit 0",
    )

    parser.add_argument(
        "--warmup_epochs", type=int, default=10, help="epochs to warmup lr"
    )

    parser.add_argument(
        "--output_dir",
        default="./task-3-results",
        help="path where to save model and log",
    )

    parser.add_argument("--device", default="cuda", help="device to use for training")

    parser.add_argument("--seed", default=0, type=int)

    parser.add_argument(
        "--resume",
        default="",
        help="resume from checkpoint",
    )

    parser.add_argument("--start_epoch", default=0, type=int)

    parser.add_argument("--num_workers", default=10, type=int)

    parser.add_argument(
        "--pin_mem",
        action="store_false",
        help="pin CPU memory in DataLoader for more efficient (sometimes) transfer to GPU.",
    )

    parser.add_argument("--dist_on_itp", action="store_true")

    return parser

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# Relevant file and directory locations.
data_path = "/scratch/local/hdd/nikian/"
anno_file = "./annotation_FSC147_384_zero_shot.json"
data_split_file = data_path + "Train_Test_Val_FSC_147.json"
class_file = data_path + "ImageClasses_FSC147.txt"
im_dir = data_path + "images_384_VarV2"
gt_dir = data_path + "gt_density_map_adaptive_384_VarV2"

with open(anno_file) as f:
    annotations = json.load(f)

with open(data_split_file) as f:
    data_split = json.load(f)

# See https://github.com/mlfoundations/open_clip/blob/37b729bc69068daa7e860fb7dbcf1ef1d03a4185/src/open_clip/transform.py
open_clip_rn50_x_16_preprocess = Compose(
    [
        Normalize(
            mean=(0.48145466, 0.4578275, 0.40821073),
            std=(0.26862954, 0.26130258, 0.27577711),
        ),
    ]
)

open_clip_vit_b_16_preprocess = Compose(
    [
        Resize(
            size=224,
            interpolation=InterpolationMode.BICUBIC,
            max_size=None,
            antialias="warn",
        ),
        Normalize(
            mean=(0.48145466, 0.4578275, 0.40821073),
            std=(0.26862954, 0.26130258, 0.27577711),
        ),
    ]
)

open_clip_vit_bigg_14_preprocess = Compose(
    [
        Resize(
            size=224,
            interpolation=InterpolationMode.BICUBIC,
            max_size=None,
            antialias=None,
        ),
        Normalize(
            mean=(0.48145466, 0.4578275, 0.40821073),
            std=(0.26862954, 0.26130258, 0.27577711),
        ),
    ]
)

open_clip_vit_l_14_336_preprocess = Compose(
    [
        Resize(
            size=336,
            interpolation=InterpolationMode.BICUBIC,
            max_size=None,
            antialias="warn",
        ),
        Normalize(
            mean=(0.48145466, 0.4578275, 0.40821073),
            std=(0.26862954, 0.26130258, 0.27577711),
        ),
    ]
)

class BatchSamplerDoubleDataset(Sampler[List[int]]):

    def __init__(self, sampler_dataset_1: Union[Sampler[int], Iterable[int]], sampler_dataset_2: Union[Sampler[int], Iterable[int]], batch_size: int, p: float) -> None:
        self.sampler_dataset_1 = sampler_dataset_1
        self.sampler_dataset_2 = sampler_dataset_2
        self.batch_size = batch_size
        self.p = p

    def __iter__(self) -> Iterator[List[int]]:
        # Implemented based on the benchmarking in https://github.com/pytorch/pytorch/pull/76951
        sample_from_2 = bernoulli(self.p)
        sampler_1_iter = iter(self.sampler_dataset_1)
        sampler_2_iter = iter(self.sampler_dataset_2)
        while True:
            try:
                sample_from_2_val = sample_from_2.rvs()
                if sample_from_2_val == 1:
                    batch = [next(sampler_2_iter) for _ in range(self.batch_size)]
                    yield batch
                else:
                    batch = [next(sampler_1_iter) for _ in range(self.batch_size)]
                    yield batch
            except StopIteration:
                try:
                    if sample_from_2_val == 0:
                        batch = [next(sampler_2_iter) for _ in range(self.batch_size)]
                        yield batch
                    else:
                        batch = [next(sampler_1_iter) for _ in range(self.batch_size)]
                        yield batch
                except StopIteration:
                    break

    def __len__(self) -> int:
        return (len(self.sampler_dataset_1) // self.batch_size) + (len(self.sampler_dataset_2) // self.batch_size)

class TrainDataFSC147(Dataset):
    def __init__(self):

        self.img = data_split["train"]
        self.img_dir = im_dir

    def __len__(self):
        return len(self.img)

    def __getitem__(self, idx):
        im_id = self.img[idx]
        anno = annotations[im_id]
        text_exemplars = anno["text_exemplars"]

        dots = np.array(anno["points"])

        image = Image.open("{}/{}".format(im_dir, im_id))
        image.load()
        density_path = gt_dir + "/" + im_id.split(".jpg")[0] + ".npy"
        density = np.load(density_path).astype("float32")
        m_flag = 0

        sample = {
            "image": image,
            "text_exemplars": text_exemplars,
            "gt_density": density,
            "dots": dots,
            "id": im_id,
            "m_flag": m_flag,
        }
        sample = TransformTrain(sample)
        return (
            open_clip_vit_b_16_preprocess(sample["image"]),
            sample["gt_density"],
            sample["text_exemplars"],
            sample["m_flag"],
        )

class TrainDataCARPK(Dataset):
    def __init__(self, data):
        self.data = data
        self.possible_queries = ["cars", "the cars", "the automobiles", "the vehicles"]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        data_item = self.data[idx]
        image = (data_item["images"].squeeze() / 255).permute((2, 0, 1))
        image = Resize((384, 683))(image)
        boxes = data_item["boxes"].squeeze()
        # Randomly pick query.
        query_ind = random.randint(0, 3)
        text = clip_tokenizer(self.possible_queries[query_ind])

        # Randomly crop image.
        start_ind = random.randint(0, 299)
        image = open_clip_vit_b_16_preprocess(image[:, :, start_ind: (start_ind + 384)])

        # Create density map.
        density_map = np.zeros((384, 683), dtype='float32')
        for box in boxes:
            box_i = [int(k) for k in box]
            x, y = int(box[0] + box[2] / 2), int(box[1] + box[3] / 2)
            x = int(x * 384 / 720)
            y = int(y * 384 / 720)
            density_map[y][x] = 1
        density_map = density_map[:, start_ind: (start_ind + 384)]
        density_map = ndimage.gaussian_filter(density_map, sigma=(1, 1), order=0) * 60
        density_map = torch.from_numpy(density_map)

        return image, density_map, text, 0

class ValData(Dataset):
    def __init__(self):

        self.img = data_split["val"]
        self.img_dir = im_dir

    def __len__(self):
        return len(self.img)

    def __getitem__(self, idx):
        im_id = self.img[idx]
        anno = annotations[im_id]
        text_exemplars = clip_tokenizer(anno["text_exemplars"])[0].unsqueeze(0)

        dots = np.array(anno["points"])

        image = Image.open("{}/{}".format(im_dir, im_id))
        image.load()
        W, H = image.size

        new_H = 16 * int(H / 16)
        new_W = 16 * int(W / 16)
        image = Resize((new_H, new_W))(image)
        image = TTensor(image)

        return image, dots, text_exemplars, im_id

def main(args):

    misc.init_distributed_mode(args)

    print("job dir: {}".format(os.path.dirname(os.path.realpath(__file__))))
    print("{}".format(args).replace(", ", ",\n"))

    device = torch.device(args.device)

    # Fix the seed for reproducibility.
    seed = args.seed + misc.get_rank()
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    cudnn.benchmark = True

    dataset_train_fsc147 = TrainDataFSC147()
    dataset_carpk = hub.load("hub://activeloop/carpk-train")
    data_loader_carpk = dataset_carpk.pytorch(num_workers=10, batch_size=1, shuffle=False)
    data_iterator_carpk = iter(data_loader_carpk)
    train_data_carpk = []
    for ind in range(len(data_loader_carpk)):
        train_data_carpk.append(next(data_iterator_carpk))
    dataset_train_carpk = TrainDataCARPK(train_data_carpk)

    dataset_train = ConcatDataset((dataset_train_fsc147, dataset_train_carpk))

    sampler_fsc147_train = torch.utils.data.RandomSampler(dataset_train_fsc147)
    sampler_carpk_train = torch.utils.data.RandomSampler(dataset_train_carpk)

    sampler_train = BatchSamplerDoubleDataset(sampler_fsc147_train, sampler_carpk_train, args.batch_size, 0.21)

    dataset_val = ValData()
    print(dataset_train)

    sampler_val = torch.utils.data.SequentialSampler(dataset_val)
    print("Sampler_train = %s" % str(sampler_train))

    data_loader_train = torch.utils.data.DataLoader(
        dataset_train,
        batch_sampler=sampler_train,
        num_workers=args.num_workers,
        pin_memory=args.pin_mem,
    )

    data_loader_val = torch.utils.data.DataLoader(
        dataset_val,
        sampler=sampler_val,
        batch_size=1,
        num_workers=args.num_workers,
        pin_memory=args.pin_mem,
        drop_last=False,
    )

    # Initialize the model.
    model = models_counting_network.__dict__[args.model]()

    model.to(device)

    print("Model = %s" % str(model))

    args.lr = args.blr * args.batch_size / 256

    print("base lr: %.2e" % (args.lr * 256 / args.batch_size))
    print("actual lr: %.2e" % args.lr)

    param_groups = optim_factory.add_weight_decay(model, args.weight_decay)
    optimizer = torch.optim.AdamW(param_groups, lr=args.lr, betas=(0.9, 0.95))
    print(optimizer)

    loss_scaler = NativeScaler()

    misc.load_model_FSC(args=args, model_without_ddp=model)

    print(f"Start training for {args.epochs} epochs")

    # Save the best MAE for the validation set.
    best_val_mae = math.inf
    best_val_epoch = 0
    start_time = time.time()
    for epoch in range(args.start_epoch, args.epochs):

        model.train(True)
        metric_logger = misc.MetricLogger(delimiter="  ")
        metric_logger.add_meter(
            "lr", misc.SmoothedValue(window_size=1, fmt="{value:.6f}")
        )
        header = "Epoch: [{}]".format(epoch)
        print_freq = 20

        train_mae = 0
        train_rmse = 0
        avg_loss = 0
        lr_to_log = 0

        optimizer.zero_grad()

        for data_iter_step, (samples, gt_density, text_exemplars, m_flag) in enumerate(
            metric_logger.log_every(data_loader_train, print_freq, header)
        ):

            lr_sched.adjust_learning_rate(
                optimizer, data_iter_step / len(data_loader_train) + epoch, args
            )

            samples = samples.to(device, non_blocking=True).half()
            gt_density = gt_density.to(device, non_blocking=True).half()
            text_exemplars = text_exemplars.to(device, non_blocking=True)

            # If there is at least one image in the batch using Type 2 Mosaic, 0-shot is banned.
            flag = 0
            for i in range(m_flag.shape[0]):
                flag += m_flag[i].item()
            if flag == 0:
                shot_num = random.randint(0, 3)
            else:
                shot_num = random.randint(1, 3)

            # Set shot number to 1 for initial experiments.
            shot_num = 1
            with torch.cuda.amp.autocast():
                output = model(samples, text_exemplars, shot_num)

            # Compute the loss.
            mask = np.random.binomial(n=1, p=0.8, size=[384, 384])
            masks = np.tile(mask, (output.shape[0], 1))
            masks = masks.reshape(output.shape[0], 384, 384)
            masks = torch.from_numpy(masks).to(device)
            loss = (output - gt_density) ** 2
            loss = (loss * masks / (384 * 384)).sum() / output.shape[0]

            loss_value = loss.item()

            # Update information on the MAE and RMSE.
            batch_mae = 0
            batch_rmse = 0
            for i in range(output.shape[0]):
                pred_cnt = torch.sum(output[i] / 60).item()
                gt_cnt = torch.sum(gt_density[i] / 60).item()
                cnt_err = abs(pred_cnt - gt_cnt)
                batch_mae += cnt_err
                batch_rmse += cnt_err**2

                if i == 0:
                    print(
                        f"{data_iter_step}/{len(data_loader_train)}: loss: {loss_value},  pred_cnt: {pred_cnt},  gt_cnt: {gt_cnt},  error: {abs(pred_cnt - gt_cnt)},  AE: {cnt_err},  SE: {cnt_err ** 2}, {shot_num}-shot "
                    )

            train_mae += batch_mae
            train_rmse += batch_rmse
            avg_loss += loss_value

            loss_scaler(
                loss,
                optimizer,
                parameters=model.parameters(),
                update_grad=True,
            )
            optimizer.zero_grad()

            metric_logger.update(loss=loss_value)

            lr = optimizer.param_groups[0]["lr"]
            metric_logger.update(lr=lr)
            lr_to_log = lr

        print("Averaged stats:", metric_logger)
        train_stats = {k: meter.global_avg for k, meter in metric_logger.meters.items()}

        # Save the model at the last epoch.
        if epoch + 1 == args.epochs:
            misc.save_model(
                args=args,
                model=model,
                model_without_ddp=model,
                optimizer=optimizer,
                loss_scaler=loss_scaler,
                epoch=epoch,
            )

        curr_train_mae = train_mae / len(dataset_train)
        curr_train_rmse = (train_rmse / len(dataset_train)) ** 0.5
        avg_loss = avg_loss / len(data_loader_train)

        # Calculate the MAE and RMSE for the validation set for each epoch.
        val_mae = 0
        val_rmse = 0
        model.eval()
        for data_iter_step, (samples, gt_dots, text_exemplars, im_id) in enumerate(
            iter(data_loader_val)
        ):

            samples = samples.to(device, non_blocking=True)
            gt_dots = gt_dots.to(device, non_blocking=True).half()
            text_exemplars = text_exemplars.to(device, non_blocking=True)

            _, _, h, w = samples.shape

            density_map = torch.zeros([h, w])
            density_map = density_map.to(device, non_blocking=True)
            start = 0
            prev = -1
            with torch.no_grad():
                while start + 383 < w:
                    (output,) = model(
                        open_clip_vit_b_16_preprocess(
                            samples[:, :, :, start : start + 384]
                        ),
                        text_exemplars,
                        1,
                    )
                    output = output.squeeze(0)
                    b1 = nn.ZeroPad2d(padding=(start, w - prev - 1, 0, 0))
                    d1 = b1(output[:, 0 : prev - start + 1])
                    b2 = nn.ZeroPad2d(padding=(prev + 1, w - start - 384, 0, 0))
                    d2 = b2(output[:, prev - start + 1 : 384])

                    b3 = nn.ZeroPad2d(padding=(0, w - start, 0, 0))
                    density_map_l = b3(density_map[:, 0:start])
                    density_map_m = b1(density_map[:, start : prev + 1])
                    b4 = nn.ZeroPad2d(padding=(prev + 1, 0, 0, 0))
                    density_map_r = b4(density_map[:, prev + 1 : w])

                    density_map = (
                        density_map_l + density_map_r + density_map_m / 2 + d1 / 2 + d2
                    )

                    prev = start + 383
                    start = start + 128
                    if start + 383 >= w:
                        if start == w - 384 + 128:
                            break
                        else:
                            start = w - 384

            pred_cnt = torch.sum(density_map / 60).item()

            gt_cnt = gt_dots.shape[1]
            cnt_err = abs(pred_cnt - gt_cnt)
            val_mae += cnt_err
            val_rmse += cnt_err**2

        curr_val_mae = val_mae / len(dataset_val)
        curr_val_rmse = (val_rmse / len(dataset_val)) ** 0.5

        # Save the model if it achieves the best MAE on the validation set.
        if curr_val_mae < best_val_mae:
            # Update the best MAE on the validation set and the epoch that achieved that MAE.
            best_val_mae = curr_val_mae
            best_val_epoch = epoch
            # The model will be saved in the output directory with the file name "checkpoint-[args.epochs].pth".
            misc.save_model(
                args=args,
                model=model,
                model_without_ddp=model,
                optimizer=optimizer,
                loss_scaler=loss_scaler,
                epoch=args.epochs,
            )

        # Log metrics to wandb.
        wandb.log(
            {
                "train_mae": curr_train_mae,
                "train_rmse": curr_train_rmse,
                "average_loss": avg_loss,
                "val_mae": curr_val_mae,
                "val_rmse": curr_val_rmse,
                "best_val_mae": best_val_mae,
                "best_val_epoch": best_val_epoch,
                "lr": lr_to_log,
            }
        )

        log_stats = {
            **{f"train_{k}": v for k, v in train_stats.items()},
            "Current MAE": curr_train_mae,
            "RMSE": curr_train_rmse,
            "epoch": epoch,
        }

        print(
            "Current MAE: {:5.2f}, RMSE: {:5.2f} ".format(
                curr_train_mae,
                curr_train_rmse,
            )
        )

        with open(
            os.path.join(args.output_dir, "log.txt"), mode="a", encoding="utf-8"
        ) as f:
            f.write(json.dumps(log_stats) + "\n")

    total_time = time.time() - start_time
    total_time_str = str(datetime.timedelta(seconds=int(total_time)))
    print("Training time {}".format(total_time_str))

if __name__ == "__main__":
    args = get_args_parser()
    args = args.parse_args()

    # Start a new wandb run to track this script.
    wandb.init(
        project="mini-project-1",
        config={
            "image_encoder_backbone": "ViT-B-16",
            "image_encoder_frozen": False,
            "text_encoder_frozen": True,
            "image_resolution": 224,
            "spatial_feature_map_dims": 14,
            "batch_size": args.batch_size,
            "epochs": args.epochs,
            "weight_decay": args.weight_decay,
            "blr": args.blr,
            "warmup_epochs": args.warmup_epochs,
            "random_loss_masking": True,
            "github_sha": "ac88d287e6f69eb1175cb5257537929885408b10",
            "dataset": "CARPK & FSC-147"
        },
    )

    Path(args.output_dir).mkdir(parents=True, exist_ok=True)
    main(args)

    # Finish the wandb run.
    wandb.finish()