kovleventer / WenkuBaiduDL

Download documents from wenku.baidu.com without registration
MIT License
6 stars 4 forks source link

Broken: JS obfuscation has been added #4

Open valpackett opened 6 months ago

valpackett commented 6 months ago

Seems like instead of JSON the server now returns a JS snippet returned that loads another JS file…

❯ python wenku_baidu_dl.py 'https://wenku.baidu.com/view/503c103c25c52cc58bd6be92.html'
http://ai.wenku.baidu.com/play/503c103c25c52cc58bd6be92?pn=1&rn=5
JSON? {
        var random = Math.random();
        var mirrorScript = document.createElement("script");
        mirrorScript.src = "//sofire.bdstatic.com/js/xaf3.js" + '?v=' + random;
        mirrorScript.setAttribute('async', 'async');
        mirrorScript.setAttribute('data-bdms-faccdee21b68', 'eyJhcHBfa2V5IjoiNzQ1NCIsImFwcF92aWV3IjoicHJvbW90ZSIsImJyb3dzZXJfdXJsIjoiaHR0cHM6Ly9zb2ZpcmUuYmFpZHUuY29tL2RhdGEvdWEvYWIuanNvbiIsImZvcm1fZGVzYyI6IiIsInNlbmRfaW50ZXJ2YWwiOjUwLCJzZW5kX21ldGhvZCI6M30=')
        var firstScriptDom = document.getElementsByTagName("script")[0];
        firstScriptDom.parentNode.insertBefore(mirrorScript, firstScriptDom);
    }
Traceback (most recent call last):
  File "/home/val/src/github.com/kovleventer/WenkuBaiduDL/wenku_baidu_dl.py", line 120, in <module>
    download_pdf(args.url, args.output, args.resolution, args.pages_per_query)
  File "/home/val/src/github.com/kovleventer/WenkuBaiduDL/wenku_baidu_dl.py", line 94, in download_pdf
    fromPage, end = download_one_block(doc_id, fromPage, resolution, pages_per_query)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/val/src/github.com/kovleventer/WenkuBaiduDL/wenku_baidu_dl.py", line 47, in download_one_block
    json_obj = json.loads(metadata)
               ^^^^^^^^^^^^^^^^^^^^
Anonymous941 commented 5 days ago

This is part of a function, not JSON:

<script>
    (function() {
        var random = Math.random();
        var mirrorScript = document.createElement("script");
        mirrorScript.src = "//sofire.bdstatic.com/js/xaf3.js" + '?v=' + random;
        mirrorScript.setAttribute('async', 'async');
        mirrorScript.setAttribute('data-bdms-faccdee21b68', '...')
        var firstScriptDom = document.getElementsByTagName("script")[0];
        firstScriptDom.parentNode.insertBefore(mirrorScript, firstScriptDom);
    })();
</script>
valpackett commented 4 days ago

That's exactly what I said. Baidu replaced the JSON that used to be there with this JS code.

Anonymous941 commented 3 days ago

@valpackett What I think is happening is they changed the AI URLs to just redirect to the homepage, and the scraper is blindly following the redirect and then trying to parse the homepage as JSON with a simple string search for { and }, which just happens to be part of a JS function. You can see this with curl (using your URL you provided):

$ curl http://ai.wenku.baidu.com/play/503c103c25c52cc58bd6be92\?pn\=1\&rn\=5
<a href="https://wenku.baidu.com/">Moved Permanently</a>.