tarampampam / mikrotik-hosts-parser

✂ Mikrotik hosts parser
MIT License
248 stars 75 forks source link

Merge records with similar "n-level" domain name and save them as "regexp" rule + algorithm explained + sample script #49

Closed jimmycr closed 4 years ago

jimmycr commented 4 years ago

Is your feature request related to a problem?

There are a lot of records, which differs at "lowest-level" domain name only. For example - here is difference at 4-th level only: 10148.engine.mobileapptracking.com 10402.engine.mobileapptracking.com 10896.engine.mobileapptracking.com 13146.engine.mobileapptracking.com 13248.engine.mobileapptracking.com 14012.engine.mobileapptracking.com 15486.engine.mobileapptracking.com ... ... ...

Describe the solution you'd like

Is it possible to add global "checkbox" (use/dont use) and global "numeric up/down" (strip "n-th" level and up) selector from which level of domain name to merge these records, so result for level 4 and up will be just 1 line record like (using REGEXP rule):

add address=127.0.0.1 comment="ADBlock" disabled=no regexp="engine.mobileapptracking.com"

Additional context

I hope this can significantly reduce "rules count" for mikrotik.

e.g.: how about WEB translation to English?

Thanks

Jan

TerAnYu commented 4 years ago

14 - He asked a similar question, but did not come up with an optimization algorithm.

jimmycr commented 4 years ago

I found same "closed" question in Russian with result that there's not an algorithm to reduce. But I think it's not that hard to do it for "level" based setting. Just strip all levels of domain name equal or higher then set and save them (stripped) to separate file. Then remove duplicates in that file and you get list of unique domain names (stripped of "n-th+" domain). Then you read this file by 1 line and check how many times you can find this stripped domain in original file. If you find more than one record, you can write "regexp rule", if you find only 1 record, you can write "name rule" from original (unstripped) list.

For example: Original list: 10148.engine.mobileapptracking.com 10402.engine.mobileapptracking.com 10896.engine.mobileapptracking.com 13146.engine.mobileapptracking.com 13248.engine.mobileapptracking.com 14012.engine.mobileapptracking.com 15486.engine.mobileapptracking.com ads.2mdnsys.com cfa.2mdnsys.com static.twinpine.adatrix.com

Stripped list (level 4 selected): engine.mobileapptracking.com engine.mobileapptracking.com engine.mobileapptracking.com engine.mobileapptracking.com engine.mobileapptracking.com engine.mobileapptracking.com engine.mobileapptracking.com ads.2mdnsys.com cfa.2mdnsys.com twinpine.adatrix.com

Stripped list no duplicates (level 4 selected) using command awk '!seen[$0]++' filename: engine.mobileapptracking.com ads.2mdnsys.com cfa.2mdnsys.com twinpine.adatrix.com

Final list: /ip dns static add address=127.0.0.1 comment="ADBlock" disabled=no regexp="engine.mobileapptracking.com" add address=127.0.0.1 comment="ADBlock" disabled=no name="ads.2mdnsys.com" add address=127.0.0.1 comment="ADBlock" disabled=no name="cfa.2mdnsys.com" add address=127.0.0.1 comment="ADBlock" disabled=no name="static.twinpine.adatrix.com"

I'm using regexp at home as rules and there is no need to write regular expression, just write "common" part.

Edit: algorithm explained

jimmycr commented 4 years ago

Another algorithm (maybe easier - I will try to prepare shell script): Read original file - one by one line, strip selected domain levels, try to find duplicities of stripped string in original file. If exist, write as "regexp" rule (stripped domain) otherwise write "name" rule - whole domain. Deduplicate written (final) file with awk '!seen[$0]++' filename so theres always one unique regexp rule for (stripped) domain.

jimmycr commented 4 years ago

So here is my shell script. It was tested with https://cdn.jsdelivr.net/gh/tarampampam/mikrotik-hosts-parser@master/.hosts/basic.txt and https://adaway.org/hosts.txt lists. It did about 20% less DNS rules on adaway hosts.txt file using parameter 4. You can save it for example as parse.sh file and run with 2 parameters (validation of input parameters not implemented!). Then run script like:

./parse.sh ./hosts.txt 4

Where: ./hosts.txt - file you want to parse 4 - what level of domain name should be checked - 4 means cut the 4th+ level of domain address from behind

Example for parameter "4": 10148.engine.mobileapptracking.com => engine.mobileapptracking.com abcdef.10148.engine.mobileapptracking.com => engine.mobileapptracking.com zxcvb.abcdef.10148.engine.mobileapptracking.com => engine.mobileapptracking.com

The script might not be perfect, did it for fun and tried my best ;)

!/bin/sh

sed -i -e "s/(127.0.0.1 |::1)//g" $1 #strip "127.0.0.1" and "::1 " text from lines sed -i "/^$/d;/^#/d;/^ /d" $1 #remove comments and empty lines or lines starting with space rev $1 > ./reverse.tmp #reverse domain name - every line in file sed -i "s/.[^.]*//$2g" ./reverse.tmp #remove n-th domain part - parameter 2 of script rev ./reverse.tmp > ./filtered.tmp #reverse trimmed domain names again to normal state awk '!seen[$0]++' ./filtered.tmp >./unique.tmp #remove duplicate lines

printf "/ip dns static\n" >./ADblock.rsc #lets start writing our script to file ADblock.rsc :) while IFS="" read -r multi || [ -n "$multi" ] #read every line of unique addresses in unique.tmp do line=$(grep -m1 -r --only-matching $multi ./$1) #if found match with original file - save 1 occurence to variable count=$(grep -r --only-matching $multi ./$1 | wc -l) #also do a count of occurencies domain=$(echo "${multi}" | awk -F"." '{print NF-1}') #count "." in unique address domain=$(($domain+2)) #add 2 so we get "domain level" of address if [ $count -gt "1" ] && [ $domain -eq "$2" ] ; then # if we found more occurencies and we are checking desired level printf "add address=127.0.0.1 comment=\"ADBlock\" disabled=no regexp=\"$multi\"\n" >>./ADblock.rsc #write regexp rule into ADblock.rsc file else printf "add address=127.0.0.1 comment=\"ADBlock\" disabled=no name=\"$line\"\n" >>./ADblock.rsc #otherwise write name rule into ADblock.rsc file fi done < unique.tmp rm -f ./*.tmp #cleanup of temp files rm -f $1 #cleanup of original file