Thank you very much for your contributions, which have been of great help to my research at present. In the process of using this framework for training, I observed a sudden increase in memory usage during validation after training the first epoch (in my case, about 10g of memory was used during training the first epoch, and 17g of memory was used during validation after the first epoch, which did not change thereafter). I speculate that 10g of memory was used for training and 7g of memory was used for validation, but the memory used for training was not released during the validation inference process, resulting in the above phenomenon. Since I haven't thoroughly researched the source code, I'm not sure if this is a necessary operation for the framework, but this issue may lead to meaningless memory usage (for example, a 24GB graphics card can only be treated as a 17GB graphics card under this training/validation framework). And I found that the training script doesn't seem to support multi graphics card separation training, which can to some extent solve the problem of insufficient graphics card memory.
Thank you very much for your contributions, which have been of great help to my research at present. In the process of using this framework for training, I observed a sudden increase in memory usage during validation after training the first epoch (in my case, about 10g of memory was used during training the first epoch, and 17g of memory was used during validation after the first epoch, which did not change thereafter). I speculate that 10g of memory was used for training and 7g of memory was used for validation, but the memory used for training was not released during the validation inference process, resulting in the above phenomenon. Since I haven't thoroughly researched the source code, I'm not sure if this is a necessary operation for the framework, but this issue may lead to meaningless memory usage (for example, a 24GB graphics card can only be treated as a 17GB graphics card under this training/validation framework). And I found that the training script doesn't seem to support multi graphics card separation training, which can to some extent solve the problem of insufficient graphics card memory.