nv-tlabs / ATISS

Code for "ATISS: Autoregressive Transformers for Indoor Scene Synthesis", NeurIPS 2021
Other
255 stars 55 forks source link

Easy overfitting to the dataset? #20

Closed Jingyu6 closed 1 year ago

Jingyu6 commented 1 year ago

Hi,

I've been trying to replicate the results on the new dataset since the old version is no longer accessible. However, when I train the model on the bedroom scenes, the validation loss seems to drop for only the first several epochs and start to increase thereafter.

Could you provide some info such as the range the validation loss achieved during your experiments? My run can achieve the training loss range fairly consistently reported in #9 but still ends up overfitting the training examples.

Thanks in advance.

Best, Jingyu

Jingyu6 commented 1 year ago

Hi @paschalidoud,

It seems like using my overfitted model would generate way worse quantitative metrics than the ones reported in the paper. So I'd really appreciate if you could give some info on the validation stats despite dataset difference. Great thanks!

Best, Jingyu

shanqiiu commented 1 year ago

Hi @Jingyu6,

I just started to run this paper. I would like to ask how long it took for you to train the bedroom dataset. I roughly calculated it. I would take about a month. Is this a bit abnormal? Great thanks!

Best, BoweiJiang

Jingyu6 commented 1 year ago

Hi @Jingyu6,

I just started to run this paper. I would like to ask how long it took for you to train the bedroom dataset. I roughly calculated it. I would take about a month. Is this a bit abnormal? Great thanks!

Best, BoweiJiang

Hi @shanqiiu,

I think it depends on the number of epochs you're aiming for. In my case, I turned it into lightning first and ran it with 4 gpus with the same effective batch size. It takes roughly 6 hours to train the model to the loss range reported in #9.

Best, Jingyu

shanqiiu commented 1 year ago

Hi @Jingyu6,

It's my great honor to get your answer. I checked the ATISS code and found that the training script is set to 10000 epochs. After reaching 4000 epochs, I found that the loss is far lower than the loss range you gave, so I don't know how many epochs are appropriate. I want to ask you for advice.Great thanks!

Best, BoweiJiang

Jingyu6 commented 1 year ago

Hi @Jingyu6,

It's my great honor to get your answer. I checked the ATISS code and found that the training script is set to 10000 epochs. After reaching 4000 epochs, I found that the loss is far lower than the loss range you gave, so I don't know how many epochs are appropriate. I want to ask you for advice.Great thanks!

Best, BoweiJiang

Hi @shanqiiu,

From what I have tried so far, runs are fairly consistent in terms of training losses. Here's one example of the model after 8000 epochs with effective batch size 128:

which takes roughly 8.5 hours with 4 gpus.

However, the validation loss starts to increase fairly early (the min is around 2k epochs), and I don't have any idea as to what were the validation loss range for the model used in the paper though. According to the test losses, the label prediction overfits the most, which almost starts to increase from the beginning and the angle prediction seems to be the least overfitted.

What losses did you get for training?

Best, Jingyu

shanqiiu commented 1 year ago

@.***,

Thank you very much for your reply. My situation is basically the same as the situation you encountered. It also happened early to overfit, and my Loss was even lower.

Best, BoweiJiang

------------------ 原始邮件 ------------------ 发件人: "nv-tlabs/ATISS" @.>; 发送时间: 2022年12月5日(星期一) 下午2:50 @.>; @.**@.>; 主题: Re: [nv-tlabs/ATISS] Easy overfitting to the dataset? (Issue #20)

Hi @Jingyu6,

It's my great honor to get your answer. I checked the ATISS code and found that the training script is set to 10000 epochs. After reaching 4000 epochs, I found that the loss is far lower than the loss range you gave, so I don't know how many epochs are appropriate. I want to ask you for advice.Great thanks!

Best, BoweiJiang

Hi @shanqiiu,

From what I have tried so far, runs are fairly consistent in terms of training losses. Here's one example of the model after 8000 epochs with effective batch size 128:

angle loss: 0.44

label loss: 1.03

size loss: 1.60

translation loss: 6.00

which takes roughly 8.5 hours with 4 gpus.

However, the validation loss starts to increase fairly early (the min is around 2k epochs), and I don't have any idea as to what were the validation loss range for the model used in the paper though. According to the test losses, the label prediction overfits the most, which almost starts to increase from the beginning and the angle prediction seems to be the least overfitted.

What losses did you get for training?

Best, Jingyu

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

paschalidoud commented 1 year ago

Hi @Jingyu6 and @shanqiiu.

First of all I apologize for my super late reply.

@shanqiiu regarding training time. Training should take 1 or 2 days, which was approximately 2000 epochs on a an NVIDIA GeForce GTX 1080 Ti GPU. However, if I remember correctly, to start generating plausible scenes it would normally take a couple of hundreds of training epochs, which was only a couple of hours.

@Jingyu6 regarding overfitting, we have also experienced some overfitting issue (the validation losses were going up, while the training loss was going down), but it was not happening right from the beginning and it wasn't affecting the downstream tasks we wanted to demonstrate. If you experience overfitting right from the beginning I think it might be related to the new data. In case the overfitting affects the downstream task, I recommend to try and alter the configuration of the transformer encoder. I think even a smaller transformer would be able to work fine.

Best, Despoina

Jingyu6 commented 1 year ago

Hi @Jingyu6 and @shanqiiu.

First of all I apologize for my super late reply.

@shanqiiu regarding training time. Training should take 1 or 2 days, which was approximately 2000 epochs on a an NVIDIA GeForce GTX 1080 Ti GPU. However, if I remember correctly, to start generating plausible scenes it would normally take a couple of hundreds of training epochs, which was only a couple of hours.

@Jingyu6 regarding overfitting, we have also experienced some overfitting issue (the validation losses were going up, while the training loss was going down), but it was not happening right from the beginning and it wasn't affecting the downstream tasks we wanted to demonstrate. If you experience overfitting right from the beginning I think it might be related to the new data. In case the overfitting affects the downstream task, I recommend to try and alter the configuration of the transformer encoder. I think even a smaller transformer would be able to work fine.

Best, Despoina

Hi @paschalidoud,

Thanks for the reply. In my case, it doesn't affect the downstream task either. I'll close the issue for now.

Best, Jingyu

Caesarr007 commented 8 months ago

Hi @Jingyu6 and @shanqiiu. First of all I apologize for my super late reply. @shanqiiu regarding training time. Training should take 1 or 2 days, which was approximately 2000 epochs on a an NVIDIA GeForce GTX 1080 Ti GPU. However, if I remember correctly, to start generating plausible scenes it would normally take a couple of hundreds of training epochs, which was only a couple of hours. @Jingyu6 regarding overfitting, we have also experienced some overfitting issue (the validation losses were going up, while the training loss was going down), but it was not happening right from the beginning and it wasn't affecting the downstream tasks we wanted to demonstrate. If you experience overfitting right from the beginning I think it might be related to the new data. In case the overfitting affects the downstream task, I recommend to try and alter the configuration of the transformer encoder. I think even a smaller transformer would be able to work fine. Best, Despoina

Hi @paschalidoud,

Thanks for the reply. In my case, it doesn't affect the downstream task either. I'll close the issue for now.

Best, Jingyu

hi, can you share your trained model, thx! if can, please email me via caesar007@sjtu.edu.cn