I had to update your regexp so that the metadata from the html was recognized.
The 'dash' sign is not actually a normal hyphen and so was being ignored by regexp
Some videos and photos had the Landscape or Portrait text before the date (ie aria-label="Video – Portrait – 4 Dec 2013, 06:30:26"
I also added a comment so that I am notified that dates were found in html
Sorry I didn't create a pull request, but thought it was just as easy to post here.
if (year === 1970 && month === 1) {
// if metadata is not available, we try to get the date from the html
console.log('Metadata not found, trying to get date from html')
const data = await page.request.get(page.url())
const html = await data.text()
const regex = /aria-label="(Photo . Landscape|Photo . Portrait|Video . Landscape|Video . Portrait|Video|Photo) . ([^"]+)"/
const match = regex.exec(html)
if (match) {
const dateString = match[2]
const date = new Date(dateString)
year = date.getFullYear()
month = date.getMonth() + 1
console.log("Found dates in html - Year = ", `${year}`, "Month = ", `${month}` )
}
}
Hi,
I had to update your regexp so that the metadata from the html was recognized.
Sorry I didn't create a pull request, but thought it was just as easy to post here.