Open ian-pvd opened 2 years ago
Depending on your needs, you could just pull it from the JSON part of the page, example:
.albumRelease[0].musicReleaseFormat
I'm seeing this in the application/ld+json
tag in the page markup, but where do I find it in the scraper results? I'm not seeing it in the AlbumInfo response. If there's a way to avoid making multiple scraper requests for the Album Info and then also the digital product price, that'd be really helpful.
Plus, it's trivial to use startsWith
to still get a positive match on "Digital AlbumDigital Album" instead of strictly equal to, but I figured this response from the scraper deserved a bug report at least.
Further debugging seems to show that the releases where this is occurring actually do have two .buyItemPackageTitle
spans inside the release list item.
Markup for a result without the issue:
<li class="buyItem digital">
<h3 class="hd">
<button class='download-link buy-link' type="button">
<span class="buyItemPackageTitle primaryText">Digital Album</span>
</button>
<div class="digitaldescription secondaryText"> Streaming + Download </div>
</h3>
...
</li>
Markup returned for a result with the duplicate text issue:
<li class="buyItem digital">
<h3 class="hd">
<button class='download-link buy-link' type="button">
<span class="buyItemPackageTitle primaryText">Digital Album</span>
</button>
<span class="buyItemPackageTitle primaryText you-own-this">Digital Album</span>
<div class="digitaldescription secondaryText"> Streaming + Download </div>
</h3>
...
</li>
This is from a dump of the html
variable returned by the get function and passed into the parser function here: https://github.com/masterT/bandcamp-scraper/blob/master/lib/index.js#L58
First, I don't own this. Second, how would the scraper know that if the request is being made from node? Seems like a weird edge case, but I am seeing this behavior consistently on specific URLs.
Either way, I assume this is the cause of the duplicated text. I'm going to try to debug this further but I just wanted to post this as an update to my initial report that there wasn't duplicate text.
Also, I'm not sure what's happening with this line const $ = cheerio.load(html)
, but by the time I dump the data
variable defined here, the duplicate text is present:
{
products: [
{
imageUrls: [],
name: 'Digital AlbumDigital Album',
nameFallback: '',
format: 'Digital AlbumDigital Album',
formatFallback: '',
priceInCents: 350,
currency: 'EUR',
offerMore: true,
soldOut: false,
nameYourPrice: false,
description: 'Includes unlimited streaming via the free Bandcamp app, plus high-quality download in MP3, FLAC and more.'
}
]
}
When using getAlbumProducts, some URLs return duplicated strings for the format prop.
For example:
This consistently returns "Digital AlbumDigital Album" as the format. I'm not sure how this is happening, since the
. buyItemPackageTitle
element only contains this text once.This seems to happen to certain URLs consistently, ex:
I'm using a random URL out of a set of 1000 for debugging in my app, and I'm seeing this ~5% of the time.
It also seems to happen to the
name
prop for some URLs, and I'm also seeing the string "Full Digital Discography" doubled.