microlinkhq / metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
https://metascraper.js.org
MIT License
2.35k stars 168 forks source link

Add more integration tests #629

Open Kikobeats opened 1 year ago

Kikobeats commented 1 year ago

Suggestions:

dadpatrol.com
hospitalitynet.org
tophotel.news
visitdetroit.com
costar.com
skift.com
prnewswire.com
thetravel.com
spokanejournal.com
nytimes.com
villagevoice.com
stpeterising.com
deccanherald.com
gsabusiness.com
biztimes.com
lodgingmagazine.com
nashvillepost.com
traveldailynews.com
finance.yahoo.com
hotelsmag.com
hometownsource.com
financialexpress.com
post-journal.com
wane.com
milehighcre.com
marketbeat.com
hotelnewsresource.com
bisnow.com
seekingalpha.com
nativenewsonline.net
papercitymag.com
thecentersquare.com
mobilesyrup.com
hospitalityandcateringnews.com
neworleanscitybusiness.com
business.inquirer.net
asia.nikkei.com
thecharlottepost.com
expressnews.com
premierconstructionnews.com
globaldesignnews.com
sportindustry.biz
bhg.com
dailymail.co.uk
ebony.com
god.dailydot.com
bleacherreport.com
lewishowes.com
app.gaia.gives
bobvila.com
autoexpress.co.uk
bollywoodbubble.com
buzzfeed.com
buzzfeednews.com
forevergeek.com
utsports.com
nybooks.com
popularmechanics.com
bostonglobe.com
venturebeat.com
oola.com
ratemyjob.com
tickld.com
scarymommy.com
edweek.org
bigmarker.com
computerworld.com
simplyrecipes.com
digitalhill.com
podcasts.apple.com
businesstoday.in
ajc.com
open.spotify.com
millionstories.com
bigthink.com
theconversation.com
freethink.com
amazon.com
boredpanda.com
hcltech.com
madamenoire.com
economictimes.indiatimes.com
nature.com
youtube.com
astralcodexten.substack.com
noahpinion.substack.com
slowboring.com
theatlantic.com
veranda.com
literalhumans.com
contentsnare.com
exprealty.com
bollywoodhungama.com
history.howstuffworks.com
people.howstuffworks.com
animals.howstuffworks.com
home.howstuffworks.com
adage.com
missourireview.com
writersrelief.com
artofmanliness.com
awkward.com
awkwardfamilyphotos.com
rolltide.com
al.com
on3.com
izumitelno.com
zsl.org
prdaily.com
smile.amazon.co.uk
landlordtoday.co.uk
theathletic.com
clutchpoints.com
patriotledger.com
capecodtimes.com
enterprisenews.com
skeptic.org.uk
cnbc.com
foxbusiness.com
politico.com
developer.salesforce.com
reg.salesforce.com
salesforceben.com
salesforce.vidyard.com
trendingpoliticsnews.com
thepatriotjournal.com
techdirt.com
thefederalist.com
apnews.com
constitutionparty.com
games.crossfit.com
mirror.co.uk
bearingarms.com
bloomberg.com
coindesk.com
news.sky.com
bitcoin.review
unchained.com
whatisbitcoin.com
wsj.com
foxnews.com
breitbart.com
cotton.senate.gov
nationalreview.com
foreign.senate.gov
docs.google.com
foodnetwork.com
apartmenttherapy.com
engadget.com
thehill.com
bbc.com
bbc.co.uk
theamericanconservative.com
pkftexas.com
whole-dog-journal.com
huffpost.com
marketingweek.com
skysports.com
citinewsroom.com
sonyaz.net
theverge.com
eksiseyler.com
legalinsurrection.com
americanthinker.com
dailycaller.com
tampafp.com
timcast.com
newsbusters.org
dailywire.com
pjmedia.com
thinkcivics.com
en-volve.com
liveaction.org
andmagazine.substack.com
redstate.com
newsweek.com
freebeacon.com
campusreform.org
lifesitenews.com
thepostmillennial.com
issuesinsights.com
the-pipeline.org
rsbnetwork.com
revolver.news
globalnews.ca
insider.com
businessinsider.com
dynamicchiropractic.com
techradar.com
bongino.com
conservativenewsdaily.net
wfla.com
dcenquirer.com
patriottruths.com
wesh.com
icontact-archive.com
gundigest.com
ammoland.com
catholicgentleman.com
looper.com
aol.com
msn.com
designboom.com
nhm.ac.uk
lehighsports.com
colgateathletics.com
espn.com
wielkahistoria.pl
nssf.org
firstpost.com
techcrunch.com
fox17online.com
cyclingweekly.com
nafme.org
act.survivalinternational.org
washingtontimes.com
theguardian.com
news.bloombergtax.com
cnn.com
patrioticmillionaires.org
anothermag.com
wegotthiscovered.com
lonelyplanet.com
houzz.com
libertyorelse.substack.com
dailyhive.com
merionwest.com
mackinac.org
tastingtable.com
communityimpact.com
vista.today
trendhunter.com
nationalgeographic.co.uk
forbes.com
live.imbibe.com
newyorker.com
eatbook.sg
wkbw.com
celebmagazine.com
steelnews.biz
inquirer.com
argonautnews.com
thedailymeal.com
orlandosentinel.com
dailyfreeman.com
perfectdailygrind.com
guiltyeats.com
digitaljournal.com
yahoo.com
nrn.com
simpleflying.com
blockclubchicago.org
torontolife.com
barrietoday.com
hercampus.com
middletownpress.com
miami.eater.com
vvdailypress.com
qsrmagazine.com
gardenandgun.com
northjersey.com
benefitnews.com
journalstar.com
njmonthly.com
eviemagazine.com
poosh.com
megaphone.southwestern.edu
curiocity.com
epicurious.com
walesonline.co.uk
maxim.com
wrat.com
frenchly.us
mtlblog.com
vanillamagazine.it
vinepair.com
franchising.com
baltimoresun.com
broadwayworld.com
bangordailynews.com
ansamed.info
espressonews.gr
bighospitality.co.uk
vice.com
sg.news.yahoo.com
foodgressing.com
faroutmagazine.co.uk
altaonline.com
sootoday.com
taiwannews.com.tw
hgazette.com
thestar.com
wszystkoconajwazniejsze.pl
srf.ch
washingtonpost.com
laist.com
tatlerasia.com
swp.de
elitedaily.com
noen.at
nzz.ch
boulderweekly.com
wdw-magazine.com
espresso-magazin.de
elephantjournal.com
libertarianinstitute.org
blog.tenthamendmentcenter.com
reason.com
frontpagemag.com
wnd.com
lifenews.com
independentsentinel.com
newspointworld.com
townhall.com
babylonbee.com
washingtonexaminer.com
whec.com
notthebee.com
justthenews.com
nypost.com
fivethirtyeight.com
jonathanturley.org
ourgeneration.news
libertywire.net
theblaze.com
thespectator.com
cnsnews.com
ijr.com
hotair.com
thestockdork.com
nymag.com
westernjournal.com
aa.com.tr
fox7austin.com
mommysbliss.com
avma.org
target.com
abcnews.go.com
chicago.suntimes.com
usatoday.com
verywellhealth.com
cbs8.com
wxow.com
sdbj.com
sprudge.com
barchart.com
penbaypilot.com
mysanantonio.com
dailycoffeenews.com
JaneJeon commented 1 year ago

On top of this, I was wondering if you'll be able to update the existing tests as well (last run in 2016). I understand that bypassing scraping protection is an issue, but with browserless (rather than good ol' got - https://github.com/microlinkhq/metascraper/blob/master/bench/index.js#L105) or puppeteer + puppeteer-extra-plugin-stealth, that should be less of an issue I presume?

Thanks!

Kikobeats commented 1 year ago

@JaneJeon will do; The HTML is passed against metascraper, so how to get that HTML is not metascraper responsibility, so there is no problem there about that 🙂