twitter / communitynotes

Documentation and source code powering Twitter's Community Notes
https://twitter.github.io/communitynotes
Apache License 2.0
1.47k stars 213 forks source link

Wrong content-length header for datasets #261

Open fdietze opened 2 months ago

fdietze commented 2 months ago

Describe the bug When downloading the datasets from https://x.com/i/communitynotes/download-data using wget, it hangs, not receiving more data, because the content-length header is too big (566M) for the file being served (185M).

To Reproduce

wget https://ton.twimg.com/birdwatch-public-data/2024/09/07/notes/notes-00000.tsv
wget https://ton.twimg.com/birdwatch-public-data/2024/09/07/notes/notes-00000.tsv
--2024-09-07 13:43:04--  https://ton.twimg.com/birdwatch-public-data/2024/09/07/notes/notes-00000.tsv
Resolving ton.twimg.com (ton.twimg.com)... 2606:2800:233:7ee2:97c:ab4c:6c70:be36, 152.199.21.140
Connecting to ton.twimg.com (ton.twimg.com)|2606:2800:233:7ee2:97c:ab4c:6c70:be36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 593120688 (566M) [text/tab-separated-values]
Saving to: ‘notes-00000.tsv’

notes-00000.tsv         32%[=======>                   ] 185.19M  --.-KB/s    eta 2m 58s

Expected behavior The content-length header should be set to the file size.

jbaxter commented 2 months ago

Interesting, I can repro this. Thanks

elvey commented 1 month ago

~Same thing, in Safari and wget, today, with https://ton.twimg.com/birdwatch-public-data/2024/09/21/notes/notes-00000.tsv

% wget https://ton.twimg.com/birdwatch-public-data/2024/09/21/notes/notes-00000.tsv --2024-09-21 19:39:33-- https://ton.twimg.com/birdwatch-public-data/2024/09/21/notes/notes-00000.tsv Resolving ton.twimg.com (ton.twimg.com)... 152.199.24.184 Connecting to ton.twimg.com (ton.twimg.com)|152.199.24.184|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 611776258 (583M) [text/tab-separated-values] Saving to: ‘notes-00000.tsv’

notes-00000.tsv 32%[===================================> ] 190.74M --.-KB/s eta 3m 51s

(BUT, at least wget does sort-of-work/fail gracefully, eventually:

2024-09-21 19:43:08 (912 KB/s) - Connection closed at byte 200002828. Retrying.

--2024-09-21 19:43:09-- (try: 2) https://ton.twimg.com/birdwatch-public-data/2024/09/21/notes/notes-00000.tsv Connecting to ton.twimg.com (ton.twimg.com)|152.199.24.184|:443... connected. HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

The file is already fully retrieved; nothing to do.)