WIP: Start hosts file with reg-ex-ed domains from

katrinleinweber commented 6 years ago

Hi! Following the suggestions in #9 and https://github.com/StevenBlack/hosts/issues/720, I wanted to test whether this works for subscribing to such a file in adblockers. Yes, it does:

grafik

So, this PR suggests to add a hosts file to /_data, which I manually merged from domains I extracted from the journals & publishers lists. It probably contains a few false-positives still, so please don't merge until we discussed how to auto-generate such a file if it is desired.

For now, it can be tested though by adding this URL https://raw.githubusercontent.com/stop-predatory-journals/stop-predatory-journals.github.io/0e64ce25fa147df7d0a79660ec91526c6436eb88/_data/hosts to adblockers like uBlock. In case some do not support this hosts file format, we can find a more widely supported format and update this PR.

lucboruta commented 6 years ago

Hi, great idea!

I'm interested in flagging URLs/URIs from predatory journals on Cobaltmetrics.com. I was about to generate a list of hosts too, good thing I checked the PRs.

I reviewed your list, and I have one suggestion regarding false positives. Some hosts host multiple journals, and I think there are cases where we don't want to block all URLs from a given host (does one bad apple spoil the whole bunch?).

For example, journals.csv includes http://journals.sfu.ca/africanem/index.php/ajtcam/index (cf. line 20), but I don't think we want to include journals.sfu.ca in the list of hosts (cf. line 1581).

What about focusing first on empty paths, / (HTTP defines an empty path to be equivalent to / anyway) and a few obvious root-like paths, e.g. anything that matches ^/(default|index|home)\.(aspx|html?|php)$ in a case-insensitive way?

Maybe also add the constraint that URLs should have no query component to be included in the list? To avoid false positives when an acceptable host hosts multiple journals, and the name of the journal is given in the query string, e.g. https://goodhost/index.php?journal=badjournal.

katrinleinweber commented 6 years ago

Good point! Using domains for blocking is rather coarse, and probably too broad. Since there has been no reaction from @stoppredatoryjournals, I guess this can be closed as out of scope.

Maybe a better approach would be to PR a conversion pipeline from _data/*.csv to an adblocker-compatible file format.

Could such a pipeline then handle the don't-block-after-all features you mention?

lucboruta commented 6 years ago

I don't know much about the internals of adblockers, e.g. if the biggest adblockers use the same syntax for their filters, but I eyeballed a few lists from https://filterlists.com/, and including paths (rather than just domains and hosts) seems possible.

In any case, yes, if the list you want to build is derived from the "main" lists, I think the code would be more valuable than the result.

lucboruta commented 6 years ago

Oh, and I extracted the set of paths from all URLs in _data/*.csv, lowercased everything and filtered out paths that contain acronyms or what looked like site- or journal-specific information. Here's the Gist: https://gist.github.com/lucboruta/0ea6ab3adac42f8eba6237ee9847c308

The list isn't very long, but there is more variation than I expected. We can't know for sure whether we can block the whole domain without some kind of manual validation.

stop-predatory-journals / stop-predatory-journals.github.io

WIP: Start hosts file with reg-ex-ed domains from #11