masterT / bandcamp-scraper

A scraper for https://bandcamp.com
MIT License
194 stars 34 forks source link

Duplicate Product Format String #61

Open ian-pvd opened 2 years ago

ian-pvd commented 2 years ago

When using getAlbumProducts, some URLs return duplicated strings for the format prop.

For example:

bandcamp.getAlbumProducts('https://bandcamp.prspct.nl/album/the-hardcore-party-ep', function (error, albumProducts) {
    console.log(albumProducts);
});

This consistently returns "Digital AlbumDigital Album" as the format. I'm not sure how this is happening, since the . buyItemPackageTitle element only contains this text once.

This seems to happen to certain URLs consistently, ex:

I'm using a random URL out of a set of 1000 for debugging in my app, and I'm seeing this ~5% of the time.

It also seems to happen to the name prop for some URLs, and I'm also seeing the string "Full Digital Discography" doubled.

ian-pvd commented 2 years ago

Depending on your needs, you could just pull it from the JSON part of the page, example:

.albumRelease[0].musicReleaseFormat

I'm seeing this in the application/ld+json tag in the page markup, but where do I find it in the scraper results? I'm not seeing it in the AlbumInfo response. If there's a way to avoid making multiple scraper requests for the Album Info and then also the digital product price, that'd be really helpful.

Plus, it's trivial to use startsWith to still get a positive match on "Digital AlbumDigital Album" instead of strictly equal to, but I figured this response from the scraper deserved a bug report at least.

ian-pvd commented 2 years ago

Further debugging seems to show that the releases where this is occurring actually do have two .buyItemPackageTitle spans inside the release list item.

Markup for a result without the issue:

<li class="buyItem digital">
    <h3 class="hd">    
        <button class='download-link buy-link' type="button">
              <span class="buyItemPackageTitle primaryText">Digital Album</span>
        </button>
        <div class="digitaldescription secondaryText">  Streaming + Download </div>
    </h3>
    ...
</li>

Markup returned for a result with the duplicate text issue:

<li class="buyItem digital">
    <h3 class="hd">
        <button class='download-link buy-link' type="button">
            <span class="buyItemPackageTitle primaryText">Digital Album</span>
        </button>
        <span class="buyItemPackageTitle primaryText you-own-this">Digital Album</span>
        <div class="digitaldescription secondaryText">  Streaming + Download </div>
    </h3>
    ...
</li>

This is from a dump of the html variable returned by the get function and passed into the parser function here: https://github.com/masterT/bandcamp-scraper/blob/master/lib/index.js#L58

First, I don't own this. Second, how would the scraper know that if the request is being made from node? Seems like a weird edge case, but I am seeing this behavior consistently on specific URLs.

Either way, I assume this is the cause of the duplicated text. I'm going to try to debug this further but I just wanted to post this as an update to my initial report that there wasn't duplicate text.

Also, I'm not sure what's happening with this line const $ = cheerio.load(html), but by the time I dump the data variable defined here, the duplicate text is present:

{
  products: [
    {
      imageUrls: [],
      name: 'Digital AlbumDigital Album',
      nameFallback: '',
      format: 'Digital AlbumDigital Album',
      formatFallback: '',
      priceInCents: 350,
      currency: 'EUR',
      offerMore: true,
      soldOut: false,
      nameYourPrice: false,
      description: 'Includes unlimited streaming via the free Bandcamp app, plus high-quality download in MP3, FLAC and more.'
    }
  ]
}