tdlib / td

Cross-platform library for building Telegram clients
https://core.telegram.org/tdlib
Boost Software License 1.0
7.11k stars 1.44k forks source link

updateFile - some files may get stuck in `is_downloading_active` in poor network conditions #2585

Open vincentneo opened 1 year ago

vincentneo commented 1 year ago

Observed on watchOS.

I will send a log to the bot, verbosity level of 10 shortly after submitting this, which the relevant ids shared below, corresponds to this log.

Using downloadFile, synchronous = false, priority 30, to download a file of size 23520 bytes, seems to not be downloaded at all, after waiting for about 5 minutes, with download_size of localFile being 0, throughout all updateFile calls. This problematic file's id is 1376.

I believe poor network conditions that is specifically used in this testing, is partly a cause to this, but what seem odd to me, is when, after 5 minutes, I tried to download another different file, this time, of size 19364 bytes, which successfully downloaded in about a minute, with the same settings and same network conditions as above. This file's id is 1369.

Both files are from 2 different messagePhoto, type m, width 320 photo files.

levlam commented 1 year ago

Your network connection seems to be extremely poor. The connections are closed by timeout all the time, because they can't receive request response in 10-30 seconds.

vincentneo commented 1 year ago

Ah. In that case where connections are timed out, am I right to say that there will not be any new updateFile emitted, and is_downloading_active will be true for the remainder of the entire session? Regardless, if I call downloadFile again after x seconds, will that override the stalled download and try again? Is this yet another side effect from #2538 and its relevant shortcomings?

I am using the macOS's network link conditioner to simulate slow networks, which was done so, because I have personally frequently observed stalled image downloads on the watch, when I am out, which I guess is with the watch networking being always much much slower than on phone for whatever reasons, even both phone and watch are on the same cellular network or when the watch is sharing the phone's connectivity.

levlam commented 1 year ago

is_downloading_active will be true for the remainder of the entire session?

It will be true until the file is successfully downloaded. There are no "stailed" downloads. The download is in progress and it should eventually succeed, but currently it can't succeed due to extemely poor network. It fails to download 20 kilobytes of data in 10 seconds.

vincentneo commented 1 year ago

There are no "stailed" downloads. The download is in progress and it should eventually succeed

The simulated network, as far as I know limits the bandwidth in a fixed manner, with the following setting:

network link conditioner, very bad network profile

If the same network is able to download 19kb in a minute (file 1369), wouldn't it signify that something is wrong with 1376 only?

levlam commented 1 year ago

No, it only means that 1376 was very lucky to have no lost packets, despite each packet is lost with 10% probability. I wouldn't expect file download to work with 10% packet loss. It is almost a broken network for TCP.

vincentneo commented 1 year ago

I have re-tested downloading of the same file (of 23520 bytes), same message. This time, the file id is 5237. Downloading did complete, but it took about 37 minutes to finish!

To eliminate the variable of packet loss, and the theory that id 1369 was about to download within a minute, I have set 0% possibility of packet loss, as well as 0ms of artificial delay, while maintaining the same bandwidth.

network link conditioner, set to 1mbps bandwidth, 0 drop packets, 0ms delay, up/down

Outside of simulation, the network where this is held, is a stable 1gbps fiber network, connected via ethernet, average 4ms ping to telegram server (149.154.171.5)

This proves the following:

What is odd:

What is interesting:

I am seeing the following in the log, that I don't recall seeing the last time: Repair file reference?

[ 3][t 4][1692783654.816176176][FileManager.cpp:2536][#1][!FileManager][&file_references]   Do not have valid file_reference for file 5237(0)
[ 3][t 4][1692783654.816184282][FileReferenceManager.cpp:432][#1][!FileReferenceManager][&file_references]  Repair file reference for file 5237(0)/5237(0)
[ 3][t 4][1692783654.816191196][FileReferenceManager.cpp:439][#1][!FileReferenceManager][&file_references]  Create new file reference repair query with generation 3
[ 3][t 4][1692783654.816197156][FileReferenceManager.cpp:263][#1][!FileReferenceManager][&file_references]  Trying to repair file reference for file 5237(0)
[ 3][t 4][1692783654.816203117][FileReferenceManager.cpp:293][#1][!FileReferenceManager][&file_references]  Send file reference repair query for file 5237(0) with generation 3 from FileSourceId(99)
[ 3][t 4][1692783658.020383119][FileReferenceManager.cpp:394][#1][!FileReferenceManager][&file_references]  Receive result of file reference repair query for file 5237(0) with generation 3 from FileSourceId(99): OK 0
[ 3][t 4][1692783658.020387172][FileManager.cpp:2552][#1][!FileReferenceManager][&file_references]  Receive result from FileSourceManager for file 5237(0): [Error : 0 : FILE_DOWNLOAD_RESTART_WITH_FILE_REFERENCE]

Log will be provided, shortly after posting this post.

levlam commented 1 year ago

The connections are closed and reopened by TDLib, because no data has been received for 7-44 seconds through them.

vincentneo commented 1 year ago

@levlam, I am very sorry for the waste of resources and the opening of this issue.

I had another look into the log, namely, the last log that I've sent to the bot, and I realised, that the testing was in fact, flawed, as there was actually a messageAnimation, 4.83MB file (id 5820), downloading as well without my knowledge at the time of conducting the tests, which I assume, used basically all of the already limited bandwidth. That was, in fact, the main problem.

However, before I close this issue, may I ask, to verify my latest understanding of how downloadFile's priority works? As a accidental discovery, I realised id 5820 is downloading with priority of 5, while the earlier file id of 5237 is downloading with priority of 30.

Am I right to say with the above context, that, as long as a download has started, priority no longer matters? I realise 5820's downloadFile was called before 5237, so my understanding with this is 5820's download will need to finish, before 5237 starts to download. Thank you.

levlam commented 1 year ago

Files are downloaded in a small chunks. Priority always matters for yet unsent file part request, but ongoing requests aren't canceled if a new download with higher priority is started. So, a part of the animation will be downloaded, then the photo, and then the rest of the animation.

vincentneo commented 1 year ago

In the case of the final log, I observed the following:

(Numbers refers to timestamp) 5820 Start Download: 1692783528.315485000 Complete Download: 1692786663.937427997

5237 Start Download: 1692783572.393924236 Complete Download: 1692785832.999131917

Since 5820 continued to receive updateFile in between the time of 5237's download, can I verify with you that, the updateFile calls are in response to the requests sent before 1692783572?

vincentneo commented 1 year ago

can I verify with you that, the updateFile calls are in response to the requests sent before 1692783572?

I assume this cannot be known from the log? If so I will close this issue.