jacktuck / unfurl

Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js :zap:
MIT License
475 stars 51 forks source link

Implement Isomorphic fetch so it works for browsers #38

Closed deadcoder0904 closed 5 years ago

deadcoder0904 commented 5 years ago

Hey, thank you for this wonderful package.

I know this is a Node JS package but can it be used from the frontend like React?

I'm currently making a Chrome extension where I need to get metadata of any site specified so I did this -

import unfurl from "unfurl";

_fetchMeta = async () => {
    try {
      let result = await unfurl("https://akshaykadam.me");
      console.log("result", result);
      alert(result);
    } catch (e) {
      console.error("e", e);
    }
};

But I get this error -

Failed to load https://akshaykadam.me/: No 'Access-Control-Allow-Origin' header is present on the 
requested resource. Origin 'http://localhost:3000' is therefore not allowed access. If an opaque 
response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS 
disabled.

e TypeError: Failed to fetch

In Chrome Extension, it can be removed by adding "permissions": ["<all_urls>"] so I did that but still doesn't work.

It should work as written here πŸ‘‡ Source: Find Cross-Domain from Chrome Extensions in CORS tutorial

I also tried running it as a Web App but still same error. So my question is it possible to do this from Frontend?

jacktuck commented 5 years ago

Hi @deadcoder0904 πŸ‘‹

Thanks for your interest in the project!

I've never actually tested how it'd behaves in the browser but the last time i considered supporting browsers I didn't think it'd be useful because of CORS. But i didn't consider the possibility of using it within a chrome extension which can bypass CORS as you've mentioned. A very cool use case!

As for why you get that CORS issue though despite adding permissions to your manifest, I'm not too sure - are you able to share the code or a reproduction?

deadcoder0904 commented 5 years ago

Hey @jacktuck I have created a repro & given the instructions in the README & in the Chrome Extension itself about installing the extension to showing the error. Let me know if you have any questions :)

deadcoder0904 commented 5 years ago

And the most weird part is it gives plain old HTML with just an axios request

axios
      .get("https://framer.com/")
      .then(console.log)
      .catch(console.error);

Which I've also included now in the example repo :)

deadcoder0904 commented 5 years ago

Oops I forgot to link to the repo πŸ€£πŸ‘‰ https://github.com/deadcoder0904/unfurl-chrome-extension-bug

deadcoder0904 commented 5 years ago

Also I forgot to mention that its not a CORS issue as stated in the error because I've read a lot of articles that state that Chrome Extension can request any Cross Origin URL without any issues. Let me know if you need any help though :)

deadcoder0904 commented 5 years ago

I think the issue must be node-fetch but I'm not so sure. Although you can try changing it to cross-fetch or isomorphic-fetch. Use cross-fetch since the author hasn't been committing to isomorphic-fetch for more than 2 years & all dependencies must be outdated.

Sorry Idk TypeScript otherwise I would've sent a PR

deadcoder0904 commented 5 years ago

Sorry for so much spam πŸ™

But I think I found the error here in the first point

Topics such as Cross-Origin, Content Security Policy, Mixed Content, Service Workers are ignored, given our server-side context.

So changing to cross-fetch will do the job AFAICT :)

jacktuck commented 5 years ago

@deadcoder0904 Hey no worries! I'll take a look at cross-fetch soon - at first glance, it looks solid.

deadcoder0904 commented 5 years ago

I tried learning Typescript & replacing with cross-fetch but I soon realised lots of code is node-specific. Like you are only fetching till head tag & if we do client side we need to fetch whole html

So what you need to do for client side thing is like simple-unfurl πŸ‘‡ https://github.com/itaditya/simple-unfurl/blob/master/index.js

The only thing different in simple-unfurl is its incomplete & your schema looks perfect so if you can extract the schema part & combine it like simple-unfurl it would be amazing. Thank you for this project. Hope I get to use it someday🀞

I'll find a workaround for my project though. If you are able to do it like cross-fetch which works for both node & browser, then it would be awesome ❀️

jacktuck commented 5 years ago

If you are able to do it like cross-fetch which works for both node & browser, then it would be awesome

I definitely intend to.

Like you are only fetching till head tag & if we do client side we need to fetch whole html

Can you explain why this bit is node specific though?

deadcoder0904 commented 5 years ago

Can you explain why this bit is node specific though?

I mean to say that you can't fetch only till head tag in client, you have to fetch everything. Also, you are using url from node.

Is there anything else I'm missing you are using from node?

Also do you have a timeframe (maybe estimation) for the browser release?

jacktuck commented 5 years ago

I mean to say that you can't fetch only till head tag in client, you have to fetch everything.

I get what you mean, but unsure as to why that is the case. It still does a GET request rather than HEAD request. It just stops parsing when </head> is reached since no metadata should be found after this point.

Is there anything else I'm missing you are using from node?

I think that would be it, i guess I need to make the module bundle friendly too, if it isn't already.

Also do you have a timeframe (maybe estimation) for the browser release?

Good question. I'll try and have something working this week, time permitting.

deadcoder0904 commented 5 years ago

Thank you I'll monkey-patch something till then :)

deadcoder0904 commented 5 years ago

I get what you mean, but unsure as to why that is the case. It still does a GET request rather than HEAD request. It just stops parsing when is reached since no metadata should be found after this point.

Oops I saw the code again & I thought Parser fetches only till head tag. I completely missed fetch fetching the whole thing & then parsing it till head 🀣

deadcoder0904 commented 5 years ago

Friendly ping :)

Any progress?

lucascorrea97 commented 5 years ago

Any updates on that?

jacktuck commented 5 years ago

Been busy with work i've recently started a new job. I'll look this evening.

jacktuck commented 5 years ago

So i've taken an initial look at how much work is involved.

And so far I figure we need to replace fetch with cross-fetch (i looked at isomophic-fetch but it uses node-fetch@1.0.0 and we need 2.X.X) and iconv-lite (or figure out how to make it isomorphic easily, docs are unclear). One interesting thing is node-fetch implemented some non-standard helpers which we made use of like timeouts and size limits. So we need to figure out if those are important and if we can implement it ontop of cross-fetch πŸ˜“

jacktuck commented 5 years ago

@deadcoder0904 i've had a look at your repo (thanks for making it btw). I have confirmed it works, at least for the domain you're testing. I've done an npm release, which is a major release since it breaks some bits such as timeouts. Checkout unfurl.js@3.0.0 and let me know how it goes!

NB! I've only tested latest version of chrome, we may well have to shim some packages like url - i think thats only work atm because node has a similar implementation to the browser, but not sure how well its supported.

deadcoder0904 commented 5 years ago

No, it's not working - https://codesandbox.io/s/1qx99lo9ml

The error is

> ModuleNotFoundError
Could not find module in path: 'unfurl.js/dist/unexpectedError' relative to 
'/node_modules/unfurl.js/dist/index.js'

I had a look & its this line https://github.com/jacktuck/unfurl/blob/a0cff1efeb6e1da746fe5c16f59015d6bbadf160/dist/index.js#L18

But, the file is UnexpectedError.js & required one is unexpectedError.js.

It's a typo somewhere πŸ˜‚

And thanks for doing it πŸ™Œ

jacktuck commented 5 years ago

Oh i think it's because I'm on a mac and its case-insensitive? I'll fix this up and see!

jacktuck commented 5 years ago

Looks like that was the case, nice find! I've published 3.0.1 to npm. Once you update package.json to the new patched version you should get the CORS error. Which is to be expected outside of a extension context. See https://codesandbox.io/s/vynqj0yqxy

Access to XMLHttpRequest at 'https://akshaykadam.me/' from origin 'https://vynqj0yqxy.codesandbox.io' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.
14:19:02.483 
deadcoder0904 commented 5 years ago

Naah still doesn't work for me. Got this error now -

Refused to set unsafe header "user-agent"

Updated code to use Hooks πŸ‘‰https://github.com/deadcoder0904/unfurl-chrome-extension-bug

bug
deadcoder0904 commented 5 years ago

Okay saw your 1st edited comment with code (please don't edit valuable comments πŸ˜‚) & I changed the breaking change from

unfurl({
  uri: "https://framer.com"
});

to

unfurl("https://framer.com");

and it finally works πŸŽ‰

Thank you πŸ™Œ

deadcoder0904 commented 5 years ago

But the error

Refused to set unsafe header "user-agent"

is still there. Other than that works fine.

jacktuck commented 5 years ago

πŸ‘ Will look at that and update the docs at some point. I think that is error is because you cant set UA header in the browser (as it is not possible)?

deadcoder0904 commented 5 years ago

I think that is error is because you cant set UA header in the browser (as it is not possible)?

Yes, according to https://stackoverflow.com/a/33144144/6141587

So I think https://github.com/jacktuck/unfurl/blob/master/src/index.ts#L50 must be changed.

So instead of doing

const res = await fetch(url, {
    headers: {
      Accept: "text/html, application/xhtml+xml",
      "User-Agent": opts.userAgent
    }
  });

I think it shall do

const isBrowser = typeof window !== undefined;
const headers = isBrowser ? {} : {
    Accept: "text/html, application/xhtml+xml",
    "User-Agent": opts.userAgent
};
const res = await fetch(url, {
    headers
  });

WDYT? Is it correct? Or are there other places where header is set?

deadcoder0904 commented 5 years ago

Also, I think the README example can be a little better. It gets confusing to find unfurl on npm since the package name is unfurl.js & there are so many unfurl packages on npm. And ./unfurl.js makes it feel like a relative path.

Instead of

const unfurl = require('./unfurl.js')
const result = unfurl('https://github.com/trending')

You can probably do

const unfurl = require('unfurl.js')
(async function() {
   const result = await unfurl('https://github.com/trending')
})();
jacktuck commented 5 years ago

@deadcoder0904 Both of those changes sound good to me. Want to make a PR?

Checking if we are in node env could be done like so typeof process !== 'undefined' && process.versions && process.versions.node;

deadcoder0904 commented 5 years ago

typeof process !== 'undefined' && process.versions && process.versions.node is hard to read than typeof window !== undefined I guess πŸ€·β€β™‚οΈ

deadcoder0904 commented 5 years ago

I tried doing it but it's failing tests. Maybe it's better for you to make changes & push a version whenever you have time :)

ubershmekel commented 4 years ago

I saw this change https://github.com/jacktuck/unfurl/commit/76cab79ec1fbc02e9e3e2e928d224c4c52b8a26b that states browsers won't be supported. Why is that the case @jacktuck ?

jacktuck commented 4 years ago

@ubershmekel Hi. That was because doing Unfurling in the browser is thwart with issues. Primarily it's unfeasible due to CORS. Which you can get around if you proxy your requests - but at that point you may as well do the unfurl there too (i.e. server side).

deadcoder0904 commented 4 years ago

That was because doing Unfurling in the browser is thwart with issues

What issues? Bummer, this was a great use for Chrome Extensions where we don't need server-side.

jacktuck commented 4 years ago

Relates to #57

deadcoder0904 commented 4 years ago

Cool, any version that works client-side? Because the schema is rad here πŸ˜‚

github-actions[bot] commented 3 years ago

:tada: This issue has been resolved in version 1.0.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket: