Select evaluation metrics for generated IHC images from H&E images

linhduongtuan commented 3 months ago

Dear Authors,

Thank you for sharing your open-source code. I am particularly interested in your paper titled 'Deep Learning-Inferred Multiplex Immunofluorescence for Immunohistochemical Image Quantification.'

However, I noticed the use of evaluation metrics like Intersection over Union (IoU), pixel accuracy, Dice score, and the aggregated Jaccard index (AJI), while other metrics like Structural Similarity Index Measure (SSIM), Multi-Scale SSIM (MS-SSIM), Peak Signal-to-Noise Ratio (PSNR), and error measures (MSE, MAE, PCC) are not included.

In the context of evaluating generated IHC images from H&E images, metrics like SSIM and PSNR seem intuitively relevant. Could you elaborate on the rationale behind using IoU, pixel accuracy, and other segmentation-focused metrics in your work?

Thank in advance, Linh

sanadeem commented 3 months ago

Thanks Linh for your interest in our work. We have reported MSE, SSIM, Inception score and FID in the Extended Data Figure 4 (https://rdcu.be/cKSBz) to evaluate image generation. In the end, image generation is only useful as far as it serves some purpose in diagnostics or other downstream analyses, hence our stronger focus on segmentation-related metrics. Happy to answer any other questions/concerns.

Best Saad

linhduongtuan commented 3 months ago

Dear Dr. Nadeem,

I am Linh Duong. I created an issue yesterday on your DeepLIIF repository .

Thank you for clarifying my concerns.

The high values you report for SSIM and MSE on the generated images are promising. However, the segmentation metrics seem lower. It would be helpful to understand the specific segmentation metrics used for a clearer comparison.

On a related note, I'm also working on a project that involves training models and evaluating them using multiple metrics.My current results achieve above 0.8 for IoU, SSIM, and MS-SSIM. It's important to acknowledge that various factors can influence evaluation metric outcomes.

However, the balance between your reported metrics (in the paper) doesn't seem to align with typical benchmarks based on my experience. Could you elaborate on the reasoning behind these choices for the evaluation metrics used in your paper?

Best regards,

Linh

On 25 Jun 2024, at 20:22, Saad Nadeem @.***> wrote:

Closed #39 https://github.com/nadeemlab/DeepLIIF/issues/39 as completed.

— Reply to this email directly, view it on GitHub https://github.com/nadeemlab/DeepLIIF/issues/39#event-13286029830, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFKZ27EJFGYXRNKFKP5TKBDZJGYQHAVCNFSM6AAAAABJ3LZEEWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGI4DMMBSHE4DGMA. You are receiving this because you authored the thread.

sanadeem commented 3 months ago

Hi Linh

All metrics are useless, some are useful. It all depends on your end goal. In our case, the image generation is only a conduit to cell segmentation/classification which in turn gives us the clinically reported scores. These scores should visually/quantitatively align with the pathologist manual scoring reference standard. This is something we have tested extensively on large external cohorts (several studies published from independent groups, more coming).

In the end, you will go back and forth with reviewers which metrics everyone can converge on, again depending on your application and end goal. For clinical deployment, in our experience most of these metrics are useless. You will have to look at individual images and make your judgement on case-by-case basis; these are not just images, these are patients who need to be given the best care possible whether it be with algorithmic-assistance or without.

Happy to answer any DeepLIIF-specific questions; otherwise this is not the best platform for a more general discussion on metrics.

linhduongtuan commented 3 months ago

Dear Dr. Nadeem,

I am Linh Duong. I created an issue yesterday on your DeepLIIF repository .

Thank you for clarifying my concerns.

The high values you report for SSIM and MSE on the generated images are promising. However, the segmentation metrics seem lower. It would be helpful to understand the specific segmentation metrics used for a clearer comparison.

On a related note, I'm also working on a project that involves training models and evaluating them using multiple metrics.My current results achieve above 0.8 for IoU, SSIM, and MS-SSIM. It's important to acknowledge that various factors can influence evaluation metric outcomes.

However, the balance between your reported metrics (in the paper) doesn't seem to align with typical benchmarks based on my experience. Could you elaborate on the reasoning behind these choices for the evaluation metrics used in your paper?

Best regards,

— Linh Duong

Researcher Dept. of Applied Physics School of Engineering Sciences KTH Royal Institute of Technology

Science for Life Laboratory, Stockholm Tomtebodavägen 23 A SE- 171 65 Solna, Sweden

E-mail: @. @.>; @. @.>

nadeemlab / DeepLIIF

Select evaluation metrics for generated IHC images from H&E images #39