Closed shagr4th closed 1 year ago
Haven't had a lot of time to look deeply into it, but from a practical standpoint just tried your branch against the files beetcamp errored out on for me previously and it works perfectly!
Thanks for this, looks good. Could you by any chance add a single test case with an example URL to test_search.py
?
I'll configure the test/lint build to run for every new PR in the future.
That's great, thanks a lot. Will release this as 0.16.2
in a second.
Hah, I've re-run the search against the URLs that were failing and found a couple of them still failed (fan URLs):
Had a look at the regular expressions and found the following:
[\w/.-]+
needed to be replaced by [^?]+
in every case (matches everything until the question mark) - ensuring the match stops with the first query paramhttps://bandcamp.com/some_fan_nickname?from=search
. My bad, should have raised this earlier but was slightly too excited of this going out 🤦🏽 have fixed it in the main
branch now; thanks loads for your support! Will try adding some contributor guidelines soon, hope it's not your last PR 😉
Nice catch, I didn't try every url ! But are you sure about the ordering ? pytest fails if the alternative domain regex is not the first to run, because url and label will be overriden with "label.bandcamp.com" (the whole domain), and not just "label", which is actually expected by the "expected_label" variable in test_search.py
Indeed, it actually didn't end up being as simple as I described above. Ultimately I ended up with
Keeping the first found match to address the issue you mentioned above
for pat in RELEASE_PATTERNS:
m = pat.search(text)
if m:
- result.update(m.groupdict())
+ result = {**m.groupdict(), **result}
Parsing the latter URL (the one inside the \<a> tag) from the html
<a href="https://taro.bandcamp.com/track/ii-22-remix?from=search&...">https://taro.bandcamp.com/track/ii-22-remix</a>
Using patterns below to obtain the label and URL separately (otherwise artist URLs and labels are mishandled)
re.compile(r">https://bandcamp\.(?P<label>[^.<]+)\.[^<]+<"),
re.compile(r">https://(?P<label>[^.]+)\.bandcamp\.[^<]+<"),
re.compile(r">https://(?P<label>(?!bandcamp)[^/]+)\.[^<]+<"),
re.compile(r">(?P<url>https://[^<]+)<"),
Additionnal fix to #37