Closed n0099 closed 1 year ago
而最近两周贴吧分布式后端暗改tbm所使用的protobuf格式接口中的两个的返回结构完全违背了请求中的_client_version
common参数用作api versioning的约定:
frs/page
(吧-主题帖接口):29号/.data.userlist
(json是/.user_list
)移到了每个主题帖下的author
(/.data.threadlist[].author
,json是/.thread_list[].author
)中并删除了authorid
的值,也就是跟回复贴-楼中楼接口(pb/floor
)或_client_version
为8.x及之前时返回的结构一样: https://github.com/n0099/TiebaMonitor/commit/32168f6abbc772df8c1396239c07e31f0983c4ac_client_version=6.0.2
(14年的远古客户端)时的threadlist[].lastreplyer
(主题帖最后回复人)给带了回来: https://github.com/n0099/TiebaMonitor/commit/8c3ea0354e599792ed5915036afccdd4139065ef#diff-0947e0e56103d7012ac23772c8f07097420782c255ae0fd119fced7493af2d2fR56threadlist[].firstpostid
(主题帖第一楼的pid)的值,导致我不得不额外请求_client_version=8.8.8.8
(具体版本号是什么无所谓,因为json api的返回结构并没有引入这些breaking changes,除了下文提到的一处)的老json api: https://github.com/n0099/TiebaMonitor/commit/42b3c800cfe3eabca6dbb39ef313e716af7d3034
tbm通过记录firstpostid和_abstruct(主题帖1L内容中的纯文本(type=0)和图片(type=3),本质PbContent的精简形式)的值可以尝试挽回发表后很快(从请求frs/page
到对该主题帖请求pb/page
的时间差之间)就被删(楼主自/吧务删/系统吞)的主题帖1L信息: https://github.com/n0099/TiebaMonitor/commit/076fafb0610d8075f18a25aa93a0fe6efbe041f3#diff-770e303757a304daabbcce0ae0707dd709089f526f3f83e07cdb0f0cc46f73edR133/.thread.thread_info
(/.thread
本身就是回复贴的父主题帖元数据)中,他有一个phone_type
,其值跟在贴吧网页端上才能看到的每个回复贴楼主使用的发帖设备(来自Android/iPhone/Windows Phone/Windows 8 UWP客户端
,也就是经典发帖接口的_client_type
字段值: https://github.com/MoeNetwork/wmzz_post/blob/4752b09ea53064be6a8f9b2cba842fa081ec75cc/wmzz_post_cron.php#L16 https://github.com/MoeNetwork/wmzz_post/blob/80aba25de46f5b2cb1a15aa2a69b527a7374ffa9/wmzz_post_setting.php#L64 ,以及更加远古的来自百度相册
)相同(android
iphone
),但由于请求的不是pb/page
(其也没有提供这样的phone_type
信息,除非我去额外请求网页端html然后解析贴吧前端每天都在改的dom结构)所以无法获得每层楼的发帖设备。而未知时间节点
后贴吧灰度删除了这个冗余的/.thread.thread_info
: https://github.com/n0099/TiebaMonitor/blob/8b6f7a179f030726c9810a372403846ba5562ebf/crawler/src/Tieba/Crawl/ThreadLateCrawlerAndSaver.cs#L77pb/page
(主题帖-回复贴接口):30号userlist
和reply.authorid
并把每个层主的用户元数据放进每个reply下,并且还是灰度发布(随机返回修改后或之前的结构)的,导致我不得不同时保留对两种结构的处理并回退: https://github.com/n0099/TiebaMonitor/commit/0e7d15bcb188da781055a573c899acfb97eacd50?t=
强制cache miss querystring param的时间戳值漂移:未知时间节点userlist[].portrait
是一个url字符串,例如 https://gss0.baidu.com/7Ls0a8Sm2Q5IlBGlnYG/sys/portraith/item/tb.1.fe425d3d.zv_7dovVrqzjOUwoT5oZVw?t=1663407014
末尾的?t=
querystring param的用途是强制让user agent(浏览器或贴吧客户端,如果后者有实现http缓存层)此前对这个get请求的缓存失效从而保证请求一次服务端gss0.baidu.com
然后再根据response header中的Cache-Control值来决定是否如何缓存: https://stackoverflow.com/questions/83990/is-it-the-filename-or-the-whole-url-used-as-a-key-in-browser-caches
这种hash based cache-control在cdn运维人眼中很常见,典型例子是使用wordpress的enqueue_style/script api添加的css/js资源可以指定一个?v=
的querystring param: https://wordpress.stackexchange.com/questions/183669/prevent-version-url-parameter-ver-x-x-x-on-enqueued-styles-scripts
或是现代前端娱乐圈中的webpack hash,其选择直接将mutatable部分放进文件名而不是url querystring: https://stackoverflow.com/questions/35176489/what-is-the-purpose-of-webpack-hash-and-chunkhash
https://github.com/n0099/TiebaMonitor/commit/f72fe89e30d06159934a6fd557773cd62bcde88e
贴吧对这个?t=
的取值是用户最后上传头像的unix时间戳,这样只要用户上传更新了自己的头像,接口返回的url变得不同就可以强迫UA绕过本地缓存,但请注意只通过这个url endpoint是无法获得历史头像的(修改?t=的值到过去不会给您返回当时使用的历史头像)
而我注意到请求相同的贴吧cdn ip所返回的这个url的?t=
值可能会偏差1~3秒,比如某用户最近只在1672471655
时更新了一次头像,但?t=
值可能会是正确的1672471655
,或是1672471656
或1672471654
我怀疑贴吧分布式微服务后端体系架构中某些服务器节点的系统时间可能太久没有通过NTP之类的协议进行对时了贴吧最近这段时间的一通breaking暗改(我知道的也只是tbm中所用到的3个接口的6/7个参数排列组合)不知道会破坏多少年久失修无人问津但可能还有贴吧遗民在用的贴吧生态工具:
其中13~17年在bug吧firefox吧chrome吧活跃的人们十分流行开发这些userscript和浏览器扩展
人生自古谁无死,不幸地,贴吧助手与助手版贴吧作者已关闭他提供付费增值服务(主要是可以对同样使用贴吧助手的人设置头像框下的称呼文本框内容,相当于后来14年贴吧会员卖的个性铭牌)的官网魔法书目录
: http://book.mofamulu.com
咔咔_嘎嘎的窝
吧: https://tieba.baidu.com/p/7852395201
与此同时截止2022年12月31号,目前用户最多的第三方客户端tiebalite仍在追赶上述第三条breaking change https://github.com/HuanCheng65/TiebaLite/commit/dbd885fb9c341dbf994b5ae13f54613a4f57d1fe#diff-7c49b0e44b1ca847eb58ceb0e83578584fb6e92c0985463ee6f25e3bc86c92bfR760 https://github.com/HuanCheng65/TiebaLite/commit/a9c6d70af5fb4a9413e30cdcb75c2f420d14b62b https://github.com/HuanCheng65/TiebaLite/commit/8790752988d72fe578d7a54a8e48915ff1a2ff64
https://github.com/bakasnow/TiebaDuster 鸡血神鸡毛毯子
https://github.com/bakasnow/TiebaManagerMini 鸡血神贴吧管理器迷你版
https://github.com/1021263881/TieBaTools
https://github.com/shitianshiwa/Tieba-Cloud-Review 这个repo的fork parent repo您自删了?我看commit author全都是您
https://github.com/cash2one/pdiff 疑似接口参数排列组合
https://github.com/96dl/Tieba-Cloud-Sign-Plugins ver4以前给tc写的插件,其中有个云审查
这大概就是贴吧程序员送给我们的圣诞礼物罢: https://github.com/Starry-OvO/aiotieba/pull/63#issuecomment-1362433590 cc @BANKA2017 tc用的那些接口近况如何
为了使用最新版本特性呗,什么虚拟形象之类的,人都用来打广告了我总不能不跟进吧
虚拟形象如何用来发广告?他完全允许ugc如用户上传个人页/名片背景图?
虚拟形象状态可以自定义,比如设成毛片网址或者群号
与此同时:我把用户页的个性签名设为四叶重工,相信品牌的力量
都会被和谐
每个主题帖的楼主用户元数据被从/.data.userlist(json是/.user_list)移到了每个主题帖下的author(/.data.threadlist[].author,json是/.thread_list[].author)中并删除了authorid的值
删除thread.authorId的罪恶行径实际上早在8月就开始了 https://github.com/n0099/TiebaMonitor/blob/47044cc68b41b957cb0b899a1be0b4780f757cf4/crawler/src/Tieba/Crawl/Parser/ThreadParser.cs#L34 https://github.com/n0099/TiebaMonitor/blob/b8d6f02e4cb8627a2b6b8bc909c01a13ad1c74ae/crawler/src/Tieba/Crawl/Parser/ThreadParser.cs#L47
我选择
UPDATE tbmc_f97650_thread AS T JOIN tbmc_f97650_reply AS R
ON R.tid = T.tid AND R.floor = 1 AND T.authorUid = 0
SET T.authorUid = R.authorUid;
前后:
- 删除了threadlist[].firstpostid(主题帖第一楼的pid)的值
实际上按照圣starry神的最高指示 https://github.com/Starry-OvO/aiotieba/issues/67#issuecomment-1376006123
加上common._client_type=2
request protobuf param后就又回来了:
但删掉就还是protobuf default value 0
:
2023-02-12 18:30:03.4341|ERROR|T84|BaseCrawlFacade`5.LogException|Exception
page: 1;fid: 898666;forumName: 贴吧意见反馈;parsed: tbm.Crawler.Db.Post.ThreadPost;raw: { "tid": "8259338363", "title": "申请解除封禁屏蔽", "replyNum": -1, "viewNum": 34, "threadTypes": 1024, "author": { }, "Abstract": [ { "text": "亲爱的各位贴吧管理组成员: " } ], "fid": "898666", "firstPostId": "146841271316", "createTime": 1676191790, "authorId": "6391274527", "agree": { } } System.Exception: Thread parse error.
---> System.OverflowException: Arithmetic operation resulted in an overflow.
at tbm.Crawler.Tieba.Crawl.Parser.ThreadParser.Convert(Thread inPost) in /src/Tieba/Crawl/Parser/ThreadParser.cs:line 43
--- End of inner exception stack trace ---
at tbm.Crawler.Tieba.Crawl.Parser.ThreadParser.Convert(Thread inPost) in /src/Tieba/Crawl/Parser/ThreadParser.cs:line 43
at System.Linq.Utilities.<>c__DisplayClass2_0`3.<CombineSelectors>b__0(TSource x)
at System.Linq.Enumerable.SelectIListIterator`2.MoveNext()
at System.Linq.Enumerable.ToDictionary[TSource,TKey,TElement](IEnumerable`1 source, Func`2 keySelector, Func`2 elementSelector, IEqualityComparer`1 comparer)
at tbm.Crawler.Tieba.Crawl.Parser.BaseParser`2.ParsePosts(CrawlRequestFlag requestFlag, IList`1 inPosts, Dictionary`2& outPosts, List`1& outUsers) in /src/Tieba/Crawl/Parser/BaseParser.cs:line 15
at tbm.Crawler.Tieba.Crawl.Facade.BaseCrawlFacade`5.ValidateThenParse(Response responseTuple) in /src/Tieba/Crawl/Facade/BaseCrawlFacade.cs:line 213
at tbm.Crawler.ExtensionMethods.ForEach[T](IEnumerable`1 source, Action`1 action) in /src/ExtensionMethods.cs:line 61
at tbm.Crawler.Tieba.Crawl.Facade.BaseCrawlFacade`5.<>c__DisplayClass26_0.<<CrawlPageRange>b__0>d.MoveNext() in /src/Tieba/Crawl/Facade/BaseCrawlFacade.cs:line 104
--- End of stack trace from previous location ---
at tbm.Crawler.Tieba.Crawl.Facade.BaseCrawlFacade`5.LogException(Func`1 payload, UInt32 page, UInt16 previousFailureCount, CancellationToken stoppingToken) in /src/Tieba/Crawl/Facade/BaseCrawlFacade.cs:line 192
c/f/frs/page
接口中的FrsPage/DataRes.thread_info.reply_num
可能是-1
在 9ff799490039b790e28d8f17a3545a1dcd7ded40 之前版本号分散在
tiebaBrowser/_api.py
的各个api封装函数中: