jacktuck / unfurl

Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js :zap:
MIT License
474 stars 51 forks source link

Wrong parsing of <title> #75

Closed davidhq closed 3 years ago

davidhq commented 3 years ago

Thank you very much for this great library, seems to work well, I'm currently testing it.

However:

unfurl.js will parse the title of https://hash.ai as "Facebook".

Reason:

This is the correct title and it appears first:

Screenshot 2021-05-25 at 21 55 53

but later on there is more stuff with and what unfurl parses is the last one:</p> <img width="416" alt="Screenshot 2021-05-25 at 21 55 41" src="https://user-images.githubusercontent.com/3899/119560370-25b77b80-bda4-11eb-9b40-e3fd4bcbd4d7.png"> <p>This works correctly:</p> <p><a rel="noreferrer nofollow" target="_blank" href="https://github.com/wzbg/read-title/blob/master/index.js">https://github.com/wzbg/read-title/blob/master/index.js</a></p> <p>would it be possible to do it this way in your library?</p> <p>Can it be done easily / soon ? If not, if you could please explain if you see this as an issue or not ... I'll think more about it and possibly fork + implement a fix myself... if there are any guidelines for me to adhere to for the fix to have a better chance of being accepted upstream, also let me know.. I'm not yet versed in typescript so the code might not be the best...</p> <p>I hope it will be possible to do something in some way, so... thank you for the feedback / thoughts on this. david</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jacktuck"><img src="https://avatars.githubusercontent.com/u/8209433?v=4" />jacktuck</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>Looks like we don't account for <code><title></code> being present in nested nodes, for example <code><title></code> inside of a <code><svg></code> (which is your test case). I think the fix would be to check we're at the root-level of the document before setting the title in metadata. </p> <p>I'd happily accept a change for this upstream and would like to have some tests for this too.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jacktuck"><img src="https://avatars.githubusercontent.com/u/8209433?v=4" />jacktuck</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>I'll be able to take a look at this later in the week and get a fix up.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/davidhq"><img src="https://avatars.githubusercontent.com/u/3899?v=4" />davidhq</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>Thank you! It would be great if you can get it fixed. If not in a week or so, I'll take a look and offer a possible fix.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/github-actions[bot]"><img src="https://avatars.githubusercontent.com/in/15368?v=4" />github-actions[bot]</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>:tada: This issue has been resolved in version 5.2.6 :tada:</p> <p>The release is available on:</p> <ul> <li><a href="https://www.npmjs.com/package/unfurl.js/v/5.2.6">npm package (@latest dist-tag)</a></li> <li><a href="https://github.com/jacktuck/unfurl/releases/tag/v5.2.6">GitHub release</a></li> </ul> <p>Your <strong><a href="https://github.com/semantic-release/semantic-release">semantic-release</a></strong> bot :package::rocket:</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jacktuck"><img src="https://avatars.githubusercontent.com/u/8209433?v=4" />jacktuck</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>Should be fixed now. It's not a very thorough fix so might need looking at again if other edge cases come up around this.</p> <pre><code>❯ yes | npx ts-node -e "require('./src/index.ts').unfurl('https://hash.ai').then(console.log)" npm WARN exec The following package was not found and will be installed: ts-node { title: 'HASH - Complex Systems Simulation', twitter_card: { card: 'summary_large_image', site: '@hashintel', title: 'HASH - Complex Systems Simulation', description: 'Build simulations in minutes not days. HASH is an open-source platform backed and built by the founders of Kaggle, Stack Overflow, Trello and Glitch.', images: [ [Object] ] }, open_graph: { url: 'https://hash.ai/hash.ai', title: 'HASH - Complex Systems Simulation', description: 'Build simulations in minutes not days. HASH is an open-source platform backed and built by the founders of Kaggle, Stack Overflow, Trello and Glitch.', images: [ [Object] ] }, description: 'Build simulations in minutes not days. HASH is an open-source platform backed and built by the founders of Kaggle, Stack Overflow, Trello and Glitch.', favicon: 'https://hash.ai/assets/img/brand/favicon-192.png' }</code></pre> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/davidhq"><img src="https://avatars.githubusercontent.com/u/3899?v=4" />davidhq</a> commented <strong> 3 years ago</strong> </div> <div class="markdown-body"> <p>tnx! it seems to work for this case. Didn't notice other issues for now.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>