Fix search on alternative domain names

snejus / beetcamp

Bandcamp autotagger source for beets (http://beets.io)

GNU General Public License v2.0

64 stars 11 forks source link

Fix search on alternative domain names #40

Closed shagr4th closed 1 year ago

shagr4th commented 1 year ago

Additionnal fix to #37

marty-oehme commented 1 year ago

Haven't had a lot of time to look deeply into it, but from a practical standpoint just tried your branch against the files beetcamp errored out on for me previously and it works perfectly!

snejus commented 1 year ago

Thanks for this, looks good. Could you by any chance add a single test case with an example URL to test_search.py?

I'll configure the test/lint build to run for every new PR in the future.

snejus commented 1 year ago

That's great, thanks a lot. Will release this as 0.16.2 in a second.

snejus commented 1 year ago

Hah, I've re-run the search against the URLs that were failing and found a couple of them still failed (fan URLs):

Had a look at the regular expressions and found the following:

[\w/.-]+ needed to be replaced by [^?]+ in every case (matches everything until the question mark) - ensuring the match stops with the first query param
Ordering is indeed important, although the alternative URLs pattern needs to be the last, otherwise it gets in the way of parsing fan URLs that look like https://bandcamp.com/some_fan_nickname?from=search.

My bad, should have raised this earlier but was slightly too excited of this going out 🤦🏽 have fixed it in the main branch now; thanks loads for your support! Will try adding some contributor guidelines soon, hope it's not your last PR 😉

shagr4th commented 1 year ago

Nice catch, I didn't try every url ! But are you sure about the ordering ? pytest fails if the alternative domain regex is not the first to run, because url and label will be overriden with "label.bandcamp.com" (the whole domain), and not just "label", which is actually expected by the "expected_label" variable in test_search.py

snejus commented 1 year ago

Indeed, it actually didn't end up being as simple as I described above. Ultimately I ended up with

Keeping the first found match to address the issue you mentioned above

    for pat in RELEASE_PATTERNS:
        m = pat.search(text)
        if m:
-            result.update(m.groupdict())
+            result = {**m.groupdict(), **result}

Parsing the latter URL (the one inside the \<a> tag) from the html

<a href="https://taro.bandcamp.com/track/ii-22-remix?from=search&amp...">https://taro.bandcamp.com/track/ii-22-remix</a>

Using patterns below to obtain the label and URL separately (otherwise artist URLs and labels are mishandled)

    re.compile(r">https://bandcamp\.(?P<label>[^.<]+)\.[^<]+<"),
    re.compile(r">https://(?P<label>[^.]+)\.bandcamp\.[^<]+<"),
    re.compile(r">https://(?P<label>(?!bandcamp)[^/]+)\.[^<]+<"),
    re.compile(r">(?P<url>https://[^<]+)<"),